Global Optimization Algorithms 
— Theory and Application — 




Evolutionary Algorithms 95 

Genetic Algorithms 141 

Genetic Programming 157 

Learning Classifier Systems 233 

Hill Climbing 253 

Simulated Annealing 263 

Example Applications 315 

Sigoa - Implementation in Java 439 

Background (Mathematics, Computer Science, . . . ) ... 455 



Thomas Weise 
Version: 2009-06-26 



Newest Version: http : / /www . it-weise . de/ 



Preface 



This e-book is devoted to global optimization algorithms, which are methods to find opti- 
mal solutions for given problems. It especially focuses on Evolutionary Computation by dis- 
cussing evolutionary algorithms, genetic algorithms, Genetic Programming, Learning Classi- 
fier Systems, Evolution Strategy, Differential Evolution, Particle Swarm Optimization, and 
Ant Colony Optimization. It also elaborates on other metaheuristics like Simulated An- 
nealing, Extremal Optimization, Tabu Search, and Random Optimization. The book is no 
book in the conventional sense: Because of frequent updates and changes, it is not really 
intended for sequential reading but more as some sort of material collection, encyclopedia, 
or reference work where you can look up stuff, find the correct context, and are provided 
with fundamentals. 

With this book, two major audience groups are addressed: 

1. It can help students since we try to describe the algorithms in an understandable, consis- 
tent way and, maybe even more important, includes much of the background knowledge 
needed to understand them. Thus, you can find summaries on stochastic theory and the- 
oretical computer science in Part IV on page 455. Additionally, application examples are 
provided which give an idea how problems can be tackled with the different techniques 
and what results can be expected. 

2. Fellow researchers and PhD students may find the application examples helpful too. For 
them, in-depth discussions on the single methodologies are included that are supported 
with a large set of useful literature references. 

If this book contains something you want to cite or reference in your work, please use the 
citation suggestion provided in Chapter D on page 591. 

In order to maximize the utility of this electronic book, it contains automatic, clickable links. 
They are shaded with dark gray so the book is still b/w printable. You can click on 

1. entries in the table of contents, 

2. citation references like [916], 

3. page references like "95" , 

4. references such as "see Figure 2.1 on page 96" to sections, figures, tables, and listings, 
and 

5. URLs and links like "http://www.lcaiia.mx/~ccoello/EMOO/ [accessed 2007-10-25]" - 1 

The following scenario is now for example possible: A student reads the text and finds a 
passage that she wants to investigate in-depth. She clicks on a citation in that seems inter- 
esting and the corresponding reference is shown. To some of the references which are online 

1 URLs are usually annotated with the date we have accessed them, like http: //www. lania. 
mx/~ccoello/EMOO/ [accessed 2007-10-25]. We can neither guarantee that their content remains un- 
changed, nor that these sites stay available. We also assume no responsibility for anything we 
linked to. 
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available, links are provided in the reference text. By clicking on such a link, the Adobe 
Reader® 2 will open another window and load the regarding document (or a browser window 
of a site that links to the document). After reading it, the student may use the "backwards" 
button in the navigation utility to go back to the text initially read in the e-book. 

The contents of this book are divided into four parts. In the first part, different optimization 
technologies will be introduced and their features are described. Often, small examples will 
be given in order to ease understanding. In the second part starting at page 315, we elab- 
orate on different application examples in detail. With the Sigoa framework, one possible 
implementation of optimization algorithms in Java, is discussed and we show how some of 
solutions of the previous problem instances can be realized in Part III on page 439. Finally, 
in the last part following at page 455, the background knowledge is provided for the rest of 
the book. Optimization is closely related to stochastic, and hence, an introduction into this 
subject can be found here. Other important background information concerns theoretical 
computer science and clustering algorithms. 

However, this book is currently worked on. It is still in a very preliminary phase where 
major parts are still missing or under construction. Other sections or texts are incomplete 
(tagged with TODO). There may as well be errors in the contents or issues may be stated 
ambiguously (I do not have proof-readers). Additionally, the sequence of the content is not 
very good. Because of frequent updates, small sections may grow and become chapters, be 
moved to another place, merged with other sections, and so on. Thus, this book will change 
often. I choose to update, correct, and improve this book continuously instead of providing 
a new version each half year or so because I think this way it has a higher utility because 
it provides more information earlier. By doing so, I also risk confusing you with strange 
grammar and structure, so if you find something fishy, please let me know so I can correct 
and improve it right away. 

The updates and improvements will result in new versions of the book, which will regularly 
appear on the website http://www.it-weise.de/. The direct download link to the newest 
version of this book is http://www.it-weise.de/projects/book.pdf. The DTgX source 
code of this book including all graphics and the bibliography is available at http://www. 
it-weise.de/projects/bookSource. zip. The source may not always be the one of the 
most current version of the book. Compiling it requires multiple runs of BibTj^X because of 
the nifty way the references are incorporated. 

I would be very happy if you provide feedback, report errors or missing things that you have 
found, criticize something, or have any additional ideas or suggestions. Do not hesitate to 
contact me via my email address tweise@gmx.de. 

Matter of fact, a large number of people helped me to improve this book over time. I 
have enumerated the most important contributors in Chapter C - Thank you guys, I really 
appreciate your help! 

Copyright © 2006-2009 Thomas Weise. 

Permission is granted to copy, distribute and/or modify this document under the terms 
of the GNU Free Documentation License, Version 1.2 or any later version published by 
the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no 
Back-Cover Texts. A copy of the license is included in the section entitled GNU Free 
Documentation License (FDL). You can find a copy of the GNU Free Documentation Li- 
cense in appendix Chapter A on page 575. 



2 The Adobe Reader© is available for download at http://www.adobe.com/products/reader/ 

[accessed 2007-08-13]. 
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At many places in this book wc refer to Wikipedia [2219] which is a great source of knowl- 
edge. Wikipedia [2219] contains articles and definitions for many of the aspects discussed in 
this book. Like this book, it is updated and improved frequently. Therefore, including the 
links adds greatly to the book's utility, in my opinion. 

Important Notice 

Be aware that this version of this book marks a point of transition from the first edition to 
the second one. Major fractions of the text of the first edition have not yet been revised and 
are, thus, not included in this document. However, I believe that this version corrects many 
shortcomings as well as inconsistencies from the first edition plus is better structured. 
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Global Optimization 



1 



Introduction 



One of the most fundamental principles in our world is the search for an optimal state. 
It begins in the microcosm where atoms in physics try to form bonds 1 in order to minimize 
the energy of their electrons [1625]. When molecules form solid bodies during the process of 
freezing, they try to assume energy-optimal crystal structures. These processes, of course, 
are not driven by any higher intention but purely result from the laws of physics. 

The same goes for the biological principle of survival of the fittest [1940] which, together 
with the biological evolution [485] , leads to better adaptation of the species to their environ- 
ment. Here, a local optimum is a well-adapted species that dominates all other animals in 
its surroundings. Homo sapiens have reached this level, sharing it with ants, bacteria, flies, 
cockroaches, and all sorts of other creepy creatures. 

As long as humankind exists, we strive for perfection in many areas. We want to reach 
a maximum degree of happiness with the least amount of effort. In our economy, profit and 
sales must be maximized and costs should be as low as possible. Therefore, optimization is 
one of the oldest of sciences which even extends into daily life [1519]. 

If something is important, general, and abstract enough, there is always a mathematical 
discipline dealing with it. Global optimization 2 is the branch of applied mathematics and nu- 
merical analysis that focuses on, well, optimization. The goal of global optimization is to find 
the best possible elements x* from a set X according to a set of criteria F — {/i, / 2 , .., /„}. 
These criteria are expressed as mathematical functions 3 , the so-called objective functions. 

Definition 1.1 (Objective Function). An objective function / : X i— > Y with Y C M is 
a mathematical function which is subject to optimization. 

The codomain Y of an objective function as well as its range must be a subset of the real 
numbers (VCR). The domain X of / is called problem space and can represent any type 
of elements like numbers, lists, construction plans, and so on. It is chosen according to the 
problem to be solved with the optimization process. Objective functions are not necessarily 
mere mathematical expressions, but can be complex algorithms that, for example, involve 
multiple simulations. Global optimization comprises all techniques that can be used to find 
the best elements x* in X with respect to such criteria / € F. 

In the remaining text of this introduction, we will first provide a rough classification 
of the different optimization techniques which we will investigate in the further course of 
this book (Section 1.1). In Section 1.2, we will outline how these best elements which we 
are after can be defined. We will use Section 1.3 to shed some more light onto the meaning 
and inter-relation of the symbols already mentioned (/, F, x, x*, X, Y, . . . ) and outline 

http : / /en . wikipedia . org/ wiki/Chemical_bond [accessed 2007-07-12] 
http : //en. wikipedia. org/wiki/Global_optimization [accessed 2007-07-03] 
3 The concept of mathematical functions is outlined in set theory in Definition 27.27 on page 462. 
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the general structure of optimization processes. If optimization was a simple thing to do, 
there wouldn't be a whole branch of mathematics with lots of cunning people dealing with 
it. In Section 1.4 we will introduce the major problems that can be encountered during 
optimization. We will discuss Formae as a general way of describing properties of possible 
solutions in Section 1.5. In this book, we will provide additional hints that point to useful 
literature, web links, conferences, and so on for all algorithms which we discuss. The first 
of these information records, dealing with global optimization in general, can be found in 
Section 1.6. 

In the chapters to follow these introductory sections, different approaches to optimization 
are discussed, examples for the applications are given, and the mathematical foundation and 
background information is provided. 

1.1 A Classification of Optimization Algorithms 

In this book, we will only be able to discuss a small fraction of the wide variety of global 
optimization techniques [1614]. Before digging any deeper into the matter, I will attempt to 
provide a classification of these algorithms as overview and discuss some basic use cases. 

1.1.1 Classification According to Method of Operation 

Figure 1.1 sketches a rough taxonomy of global optimization methods. Generally, optimiza- 
tion algorithms can be divided in two basic classes: deterministic and probabilistic algo- 
rithms. Deterministic algorithms (see also Definition 30.11 on page 550) are most often used 
if a clear relation between the characteristics of the possible solutions and their utility for a 
given problem exists. Then, the search space can efficiently be explored using for example a 
divide and conquer scheme 4 . If the relation between a solution candidate and its "fitness" 
are not so obvious or too complicated, or the dimensionality of the search space is very high, 
it becomes harder to solve a problem deterministically. Trying it would possible result in 
exhaustive enumeration of the search space, which is not feasible even for relatively small 
problems. 

Then, probabilistic algorithms 5 come into play. The initial work in this area which now 
has become one of most important research fields in optimization was started about 55 years 
ago (see [1743, 750, 219], and [287]). An especially relevant family of probabilistic algorithms 
are the Monte Carlo 6 -based approaches. They trade in guaranteed correctness of the solution 
for a shorter runtime. This does not mean that the results obtained using them are incorrect 
- they may just not be the global optima. On the other hand, a solution a little bit inferior 
to the best possible one is better than one which needs 10 100 years to be found. . . 

Heuristics used in global optimization are functions that help decide which one of a set 
of possible solutions is to be examined next. On one hand, deterministic algorithms usually 
employ heuristics in order to define the processing order of the solution candidates. An 
example for such a strategy is informed searche, as discussed in Section 17.4 on page 295. 
Probabilistic methods, on the other hand, may only consider those elements of the search 
space in further computations that have been selected by the heuristic. 

Definition 1.2 (Heuristic). A heuristic' [1407, 1711, 1626] is a part of an optimization 
algorithm that uses the information currently gathered by the algorithm to help to decide 
which solution candidate should be tested next or how the next individual can be produced. 
Heuristics are usually problem class dependent. 

4 http://en.wikipedia.org/wiki/Divide_and_conquer_algorithm [acceded 2007-07-09] 

5 The common properties of probabilistic algorithms are specified in Definition 30.18 on page 552. 

6 See Definition 30.20 on page 552 for a in-depth discussion of the Monte Carlo-type probabilistic 
algorithms 

7 http : //en. wikipedia. org/wiki/Heuristic_y,28computer_science'/,29 [accessed 2007-07-03] 
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Figure 1.1: The taxonomy of global optimization algorithms. 



Definition 1.3 (Metaheuristic). A metaheuristic 8 is a method for solving very general 
classes of problems. It combines objective functions or heuristics in an abstract and hopefully 
efficient way, usually without utilizing deeper insight into their structure, i. e., by treating 
them as black-box-procedures [813, 832, 233]. 

This combination is often performed stochastically by utilizing statistics obtained from 
samples from the search space or based on a model of some natural phenomenon or physical 
process. Simulated annealing, for example, decides which solution candidate to be evalu- 
ated next according to the Boltzmann probability factor of atom configurations of solid- 
ifying metal melts. Evolutionary algorithms copy the behavior of natural evolution and 
treat solution candidates as individuals that compete in a virtual environment. Unified 



http : //en. wikipedia. org/wiki/Metaheuristic [accessed 2007-07-03] 
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models of mctaheuristic optimization procedures have been proposed by Vaessens et al. 
[2087, 2088], Rayward-Smith [1710], Osman [1588], and Taillard et al. [1996]. 

An important class of probabilistic Monte Carlo metaheuristics is Evolutionary Compu- 
tation 9 . It encompasses all algorithms that are based on a set of multiple solution candidates 
(called population) which are iteratively refined. This field of optimization is also a class 
of Soft Computing 10 as well as a part of the artificial intelligence 11 area. Some of its most 
important members are evolutionary algorithms and Swarm Intelligence, which will be dis- 
cussed in-depth in this book. Besides these nature-inspired and evolutionary approaches, 
there exist also methods that copy physical processes like the before-mentioned Simulated 
Annealing, Parallel Tempering, and Raindrop Method, as well as techniques without direct 
real-world role model like Tabu Search and Random Optimization. As a preview of what can 
be found in this book, we have marked the techniques that will be discussed with a thicker 
border in Figure 1.1. 

1.1.2 Classification According to Properties 

The taxonomy just introduced classifies the optimization methods according to their algo- 
rithmic structure and underlying principles, in other words, from the viewpoint of theory. A 
software engineer or a user who wants to solve a problem with such an approach is however 
more interested in its "interfacing features" such as speed and precision. 

Speed and precision are conflicting objectives, at least in terms of probabilistic algo- 
rithms. A general rule of thumb is that you can gain improvements in accuracy of opti- 
mization only by investing more time. Scientists in the area of global optimization try to 
push this Pareto frontier 12 further by inventing new approaches and enhancing or tweaking 
existing ones. 

Optimization Speed 

When it comes to time constraints and hence, the required speed of the optimization algo- 
rithm, we can distinguish two main types of optimization use cases. 

Definition 1.4 (Online Optimization). Online optimization problems are tasks that need 
to be solved quickly in a time span between ten milliseconds to a few minutes. In order to 
find a solution in this short time, optimality is normally traded in for speed gains. 

Examples for online optimization are robot localization, load balancing, services com- 
position for business processes (see for example Section 22.2.1 on page 384), or updating 
a factory's machine job schedule after new orders came in. From the examples, it becomes 
clear that online optimization tasks are often carried out repetitively - new orders will, for 
instance, continuously arrive in a production facility and need to be scheduled to machines 
in a way that minimizes the waiting time of all jobs. 

Definition 1.5 (Offline Optimization). In offline optimization problems, time is not so 
important and a user is willing to wait maybe even days if she can get an optimal or close- 
to-optimal result. 

Such problems regard for example design optimization, data mining (see for in- 
stance Section 22.1 on page 373), or creating long-term schedules for transportation crews. 
These optimization processes will usually be carried out only once in a long time. 

Before doing anything else, one must be sure about to which of these two classes the 
problem to be solved belongs. 



9 http://en.wikipedia.org/wiki/Evolutionary_computation [accessed 2007-09-17] 

10 http://en.wikipedia.org/wiki/Soft_computing [accessed 2007-09-17] 

11 http://en.wikipedia.org/wiki/Artificial_intelligence [accessed 2007-0917] 

12 Pareto frontiers will be discussed in Section 1.2.2 on page 31. 
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Number of Criteria 

Optimization algorithms can be divided in such which try to find the best values of single 
objective functions / and such that optimize sets F of target functions. This distinction 
between single-objective optimization and multi-objective optimization is discussed in depth 
in Section 1.2.2. 



1.2 What is an optimum? 

We have already said that global optimization is about finding the best possible solutions 
for given problems. Thus, it cannot be a bad idea to start out by discussing what it is that 
makes a solution optimal 15 . 

1.2.1 Single Objective Functions 

In the case of optimizing a single criterion /, an optimum is either its maximum or minimum, 
depending on what we are looking for. If we own a manufacturing plant and have to assign 
incoming orders to machines, we will do this in a way that miniminzes the time needed 
to complete them. On the other hand, we will arrange the purchase of raw material, the 
employment of staff, and the placing of commercials in a way that maximizes our profit. In 
global optimization, it is a convention that optimization problems are most often defined 
as minimizations and if a criterion / is subject to maximization, we simply minimize its 
negation (— /). 

Figure 1.2 illustrates such a function / defined over a two-dimensional space X = 
(Xi,X2). As outlined in this graphic, we distinguish between local and global optima. A 
global optimum is an optimum of the whole domain X while a local optimum is an optimum 
of only a subset of X. 

Definition 1.6 (Local Maximum). A (local) maximum i;€X of one (objective) function 
/ : X i— > R is an input element with f(xi) > f(x) for all x neighboring x\. 

If X C R" , we can write: 



Definition 1.7 (Local Minimum). A (local) minimum x\ £ X of one (objective) function 
/ : X i— > R is an input element with f(xi) < f(x) for all x neighboring x\. 

If X C R, we can write: 



Definition 1.8 (Local Optimum). A (local) optimum iJeX of one (objective) function 
/ : X i — > M is either a local maximum or a local minimum. 

Definition 1.9 (Global Maximum). A global maximum xeiof one (objective) function 
/ : X h> R is an input element with /(x) > f(x) Vx <E X. 

Definition 1.10 (Global Minimum). A global minimum x £ X of one (objective) func- 
tion / : X h> M is an input element with /(x) < f(x) Vx G X. 

13 http://en.wikipedia.org/wiki/Maxima_and_minima [accessed 2007-07-03] 
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Figure 1.2: Global and local optima of a two-dimensional function. 



Definition 1.11 (Global Optimum). A global optimum x* e X of one (objective) func- 
tion / : X i— > R is either a global maximum or a global minimum. 

Even a one-dimensional function / : I = M h R may have more than one global 
maximum, multiple global minima, or even both in its domain X. Take the cosine function 
for example: It has global maxima Xi at Xi = 2i7r and global minima Xi at x^ = (2i + 1)tt for 
all j 6 Z. The correct solution of such an optimization problem would then be a set X* of 
all optimal inputs in X rather than a single maximum or minimum. Furthermore, the exact 
meaning of optimal is problem dependent. In single-objective optimization, it either means 
minimum or maximum. In multi-objective optimization, there exist a variety of approaches 
to define optima which we will discuss in-depth in Section 1.2.2. 

Definition 1.12 (Optimal Set). The optimal set X* is the set that contains all optimal 
elements. 

There are normally multiple, often even infinite many optimal solutions. Since the mem- 
ory of our computers is limited, we can find only a finite (sub-)set of them. We thus dis- 
tinguish between the global optimal set X* and the set X* of (seemingly optimal) elements 
which an optimizer returns. The tasks of global optimization algorithms arc 

1. to find solutions that are as good as possible and 

2. that are also widely different from each other [534]. 

The second goal becomes obvious if we assume that we have an objective function / : 
IhI which is optimal for all x £ [0, 10] x £ X*. This interval contains uncountable 
many solutions, and an optimization algorithm may yield X^ — {0,0.1,0.11,0.05,0.01} or 
X2 = {0,2.5,5,7.5,10} as result. Both sets only represent a small subset of the possible 
solutions. The second result (X£), however, gives us a broader view on the optimal set. 
Even good optimization algorithms do not necessarily find the real global optima but may 
only be able to approximate them. In other words, X£ = {—0.3, 5, 7.5, 11} is also a possible 
result of the optimization process, although containing two sub-optimal elements. 

In Chapter 19 on page 307, we will introduce different algorithms and approaches that 
can be used to maintain an optimal set or to select the optimal elements from a given set 
during an optimization process. 
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Global optimization techniques are not just used for finding the maxima or minima of single 
functions /. In many real- world design or decision making problems, they are rather applied 
to sets F consisting of n = \F\ objective functions fi, each representing one criterion to be 
optimized [537, 360, 716]. 



Algorithms designed to optimize such sets of objective functions are usually named with 
the prefix -multi- objective, like mult i- objective evolutionary algorithms which are discussed 
in Definition 2.2 on page 96. 

Examples 

Factory Example 

Multi-objective optimization often means to compromise conflicting goals. If we go back to 
our factory example, we can specify the following objectives that all are subject to optimiza- 
tion: 

1. Minimize the time between an incoming order and the shipment of the corresponding 
product. 

2. Maximize profit. 

3. Minimize costs for advertising, personal, raw materials etc.. 

4. Maximize product quality. 

5. Minimize negative impact on environment. 

The last two objectives seem to contradict clearly the cost minimization. Between the per- 
sonal costs and the time needed for production and the product quality there should also be 
some kind of (contradictive) relation. The exact mutual influences between objectives can 
apparently become complicated and are not always obvious. 

Artificial Ant Example 

Another example for such a situation is the Artificial Ant problem 14 where the goal is to 
find the most efficient controller for a simulated ant. The efficiency of an ant should not only 
be measured by the amount of food it is able to pile. For every food item, the ant needs 
to walk to some point. The more food it piles, the longer the distance it needs to walk. If 
its behavior is driven by a clever program, it may walk along a shorter route which would 
not be discovered by an ant with a clumsy controller. Thus, the distance it has to cover 
to find the food or the time it needs to do so may also be considered in the optimization 
process. If two control programs produce the same results and one is smaller (i. e., contains 
fewer instructions) than the other, the smaller one should be preferred. Like in the factory 
example, the optimization goals conflict with each other. 

From these both examples, we can gain another insight: To find the global optimum 
could mean to maximize one function fi £ F and to minimize another one fj € F, (i =/= j). 
Hence, it makes no sense to talk about a global maximum or a global minimum in terms 
of multi-objective optimization. We will thus retreat to the notation of the set of optimal 
elements x* e X* C X. 

Since compromises for conflicting criteria can be defined in many ways, there exist mul- 
tiple approaches to define what an optimum is. These different definitions, in turn, lead to 
different sets X*. 




(1.3) 
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Figure 1.3: Two functions f\ and f% with different maxima xi and X2. 



Graphical Example 1 

We will discuss some of these approaches in the following by using two graphical examples 
for illustration purposes. In the first example pictured in Figure 1.3, we want to maximize 
two independent objective functions F\ — {fx, fa}. Both objective functions have the real 
numbers K as problem space Xi. The maximum (and thus, the optimum) of fi is Xi and 
the largest value of fa is at £2. In Figure 1.3, we can easily see that /1 and fi are partly 
conflicting: Their maxima are at different locations and there even exist areas where f\ rises 
while f-i falls and vice versa. 

Graphical Example 2 




Figure 1.4: Two functions f^ and f^ with different minima x 1; x 2 , x 3 , and x 4 . 



The objective functions /1 and f 2 in the first example are mappings of a one-dimensional 
problem space Xi to the real numbers that are to be maximized. In the second exam- 
ple sketched in Figure 1.4, we instead minimize two functions f% and fi that map a two- 
dimensional problem space X2 C K. 2 to the real numbers R. Both functions have two global 
minima; the lowest values of /3 are Xi and X2 whereas fi gets minimal at X3 and X4. It 
should be noted that x x ^ x 2 7^ X3 / x 4 . 



See Section 21.3.1 on page 354 for more details. 



1.2 What is an optimum? 29 

Weighted Sums (Linear Aggregation) 

The simplest method to define what is optimal is computing a weighted sum g(x) of all 
the functions fi(x) £ F. 10 Each objective fa is multiplied with a weight Wi representing its 
importance. Using signed weights also allows us to minimize one objective and to maximize 
another. We can, for instance, apply a weight w a = 1 to an objective function /„ and the 
weight wb = —1 to the criterion fa. By minimizing g(x), we then actually minimize the 
first and maximize the second objective function. If we instead maximize g{x), the effect 
would be converse and fa would be minimized and fa would be maximized. Either way, 
multi-objective problems are reduced to single-objective ones by this method. 



9(x) = ^wJiix) = w ifii x ) ( L4 ) 
x* £ X* <S> g(x*) > g(x) VieX (1.5) 

Graphical Example 1 

Figure 1.5 demonstrates optimization with the weighted sum approach for the example given 
in Section 1.2.2. The weights are both set to 1 = w\ = W2- If we maximize gi(2), we will 
thus also maximize the functions fa and fa . This leads to a single optimum x* = x. 



y=g 1 (x)=f 1 (x)+f 2 (x) 
y,=f 2 (x) 




Figure 1.5: Optimization using the weighted sum approach (first example). 



Graphical Example 2 

The sum of the two-dimensional functions fa and fa from the second graphical example 
given in Section 1.2.2 is sketched in Figure 1.6. Again we set the weights W3 and to 1. 
The sum g2 however is subject to minimization. The graph of 32 has two especially deep 
valleys. At the bottoms of these valleys, the two global minima X5 and X6 can be found. 

Problems with Weighted Sums 

The drawback of this approach is that it cannot handle functions that rise or fall with 
different speed 1 '' properly. In Figure 1.7, we have sketched the sum g(x) of the two objective 
functions f\{x) — —x 2 and fa(x) = e x ~ 2 . When minimizing or maximizing this sum, we 

15 This approach applies a linear aggregation function for fitness assignment and is therefore also 
often referred to as linear aggregating. 

16 See Section 30.1.3 on page 550 

or http://en.wikipedia.org/wiki/Asymptotic_notation [accessed 2007-07-03] for related informa- 
tion. 
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Figure 1.6: Optimization using the weighted sum approach (second example). 
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y=g(x)=f 1 (x)+f 2 (x) 

y 2 =f 2 (x) 




Figure 1.7: A problematic constellation for the weighted sum approach. 



will always disregard one of the two functions, depending on the interval chosen. For small 
x, ji is negligible compared to f\. For x > 5 it begins to outpace /i which, in turn, will 
now become negligible. Such functions cannot be added up properly using constant weights. 
Even if we would set w% to the really large number 10 10 , fi will become insignificant for 

-(40 2 )*10 10 



all x > 40, because 



0.0005. Therefore, weighted sums are only suitable 



to optimize functions that at least share the same big-O notation (see Section 30.1.3 on 
page 550). Often, it is not obvious how the objective functions will fall or rise. How can we, 
for instance, determine whether the objective maximizing the food piled by an Artificial Ant 
rises in comparison to the objective minimizing the distance walked by the simulated insect? 
And even if the shape of the objective functions and their complexity class were clear, the 
question about how to set the weights w properly still remains open in most cases [487]. In 
the same paper, Das and Dennis [487] also show that with weighted sum approaches, not 
necessarily all elements considered optimal in terms of Pareto domination will be found. 
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Pareto Optimization 

The mathematical foundations for multi-objective optimization which considers conflicting 
criteria in a fair way has been laid by Vilfredo Pareto [1615] 110 years ago [1225]. Pareto 
optimality 1 ' became an important notion in economics, game theory, engineering, and social 
sciences [390, 2219, 1587, 752]. It defines the frontier of solutions that can be reached by 
trading-off conflicting objectives in an optimal manner. From this front, a decision maker 
(be it a human or an algorithm) can finally choose the configurations that, in his opinion, 
suit best [715, 716, 375, 1961, 877, 760, 177]. The notation of optimal in the Pareto sense is 
strongly based on the definition of domination: 

Definition 1.13 (Domination). An element x\ dominates (is preferred to) an element 
x 2 [x\ b x 2 ) if x\ is better than x 2 in at least one objective function and not worse with 
respect to all other objectives. Based on the set F of objective functions /, we can write: 

X\ b x 2 Vi : < i < n u>ifi{x{) < u>ifi(x 2 ) A 
3j : < j < n : Ujfj(x{) < Wjfj{x 2 ) 

{1 if fi should be minimized 
— 1 if fi should be maximized 

Different from the weights in the weighted sum approach, the factors only carry 
sign information which allows us to maximize some objectives and to minimize some other 
criteria. 

The Pareto domination relation defines a strict partial order (see Definition 27.31 on 
page 463) on the space of possible objective values. In contrast, the weighted sum approach 
imposes a total order by projecting it into the real numbers M.. 

Definition 1.14 (Pareto Optimal). An element x* £ X is Pareto optimal (and hence, 
part of the optimal set X*) if it is not dominated by any other element in the problem space 
X. In terms of Pareto optimization, X* is called the Pareto set or the Pareto Frontier. 

x* £ X* <&fix £ X : x b x* (1.8) 

Graphical Example 1 

In Figure 1.8, we illustrate the impact of the definition of Pareto optimality on our first 
example (outlined in Section 1.2.2). We assume again that f\ and f 2 should both be maxi- 
mized and hence, oj\ = u 2 = — 1. The areas shaded with dark gray are Pareto optimal and 
thus, represent the optimal set X* = [a; 2 ,x 3 ] U [x 5 ,x 6 ] which here contains infinite many 
elements 18 . All other points are dominated, i. e., not optimal. 

The points in the area between x\ and x 2 (shaded in light gray) are dominated by other 
points in the same region or in [x 2 ,xs], since both functions /i and f 2 can be improved by 
increasing x. If we start at the leftmost point in X (which is position x\), for instance, we 
can go one small step A to the right and will find a point x\ + A dominating x\ because 
fi(xi + A) > fi(xi) and f 2 {x\ + A) > f 2 (xi). We can repeat this procedure and will always 
find a new dominating point until we reach x 2 . x 2 demarks the global maximum of f 2 , the 
point with the highest possible f 2 value, which cannot be dominated by any other point in 
X by definition (see Equation 1.6). 

From here on, f 2 will decrease for a while, but f\ keeps rising. If we now go a small step 
A to the right, we will find a point x 2 + A with f 2 (x 2 + A) < f 2 (x 2 ) but also fi(x 2 + A) > 
fi(x 2 ). One objective can only get better if another one degenerates. In order to increase /i, 
f 2 would be decreased and vice versa and so the new point is not dominated by x 2 . Although 

17 http://en.wikipedia.org/wiki/Pareto_efficiency [accessed 2007-07-03] 

18 In practice, of course, our computers can only handle finitely many elements 
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some of the f^ix) values of the other points x G [2:1, £2) may be larger than $2{x2 + 
fi{%2 + A) > fi (x) holds for all of them. This means that no point in [x%, X2) can dominate 
any point in [x2, X4) because fi keeps rising until 24 is reached. 

At £3 however, / 2 steeply falls to a very low level. A level lower than ^(^s)- Since the /1 
values of the points in [25,2^] are also higher than those of the points in (x^,X4\, all points 
in the set [x^, xq] (which also contains the global maximum of fi) dominate those in (23, 2:4]. 
For all the points in the white area between x± and £5 and after x§, we can derive similar 
relations. All of them are also dominated by the non-dominated regions that we have just 
discussed. 



Graphical Example 2 

Another method to visualize the Pareto relationship is outlined in Figure 1.9 for our second 
graphical example. For a certain resolution of the problem space X2 , we have counted the 
number of elements that dominate each element 1 e X2. The higher this number, the 
worst is the element x in terms of Pareto optimization. Hence, those solution candidates 
residing in the valleys of Figure 1.9 are better than those which are part of the hills. This 
Pareto ranking approach is also used in many optimization algorithms as part of the fitness 
assignment scheme (see Section 2.3.3 on page 112, for instance). A non-dominated element 
is, as the name says, not dominated by any other solution candidate. These elements are 
Pareto optimal and have a domination-count of zero. In Figure 1.9, there are four such areas 

■^1) ^2) -^-3' an( i -^-4- 




Figure 1.9: Optimization using the Pareto Frontier approach (second example). 
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If we compare Figure 1.9 with the plots of the two functions fa and in Figure 1.4, we 
can see that hills in the domination space occur at positions where both, / 3 and f\ have high 
values. Conversely, regions of the problem space where both functions have small values are 
dominated by very few elements. 

Besides these examples here, another illustration of the domination relation which may 
help understanding Pareto optimization can be found in Section 2.3.3 on page 112 (Figure 2.4 
and Table 2.1). 

Problems of Pure Pareto Optimization 

The complete Pareto optimal set is often not the wanted result of an optimization algorithm. 
Usually, we are rather interested in some special areas of the Pareto front only. 

Artificial Ant Example We can again take the Artificial Ant example to visualize this prob- 
lem. In Section 1.2.2 on page 27 we have introduced multiple conflicting criteria in this 
problem. 

1. Maximize the amount of food piled. 

2. Minimize the distance covered or the time needed to find the food. 

3. Minimize the size of the program driving the ant. 

Pareto optimization may now yield for example: 

1. A program consisting of 100 instructions, allowing the ant to gather 50 food items when 
walking a distance of 500 length units. 

2. A program consisting of 100 instructions, allowing the ant to gather 60 food items when 
walking a distance of 5000 length units. 

3. A program consisting of 10 instructions, allowing the ant to gather 1 food item when 
walking a distance of 5 length units. 

4. A program consisting of instructions, allowing the ant to gather food item when 
walking a distance of length units. 

The result of the optimization process obviously contains two useless but non-dominated 
individuals which occupy space in the population and the non-dominated set. We also invest 
processing time in evaluating them, and even worse, they may dominate solutions that are 
not optimal but fall into the space behind the interesting part of the Pareto front. Further- 
more, memory restrictions usually force us to limit the size of the list of non-dominated 
solutions found during the search. When this size limit is reached, some optimization al- 
gorithms use a clustering technique to prune the optimal set while maintaining diversity. 
On one hand, this is good since it will preserve a broad scan of the Pareto frontier. In this 
case on the other hand, a short but dumb program is of course very different from a longer, 
intelligent one. Therefore, it will be kept in the list and other solutions which differ less from 
each other but are more interesting for us will be discarded. 

Furthermore, non-dominated elements have a higher probability of being explored fur- 
ther. This then leads inevitably to the creation of a great proportion of useless offspring. In 
the next generation, these useless offspring will need a good share of the processing time to 
be evaluated. 

Thus, there are several reasons to force the optimization process into a wanted direction. 
In Section 22.2.2 on page 390 you can find an illustrative discussion on the drawbacks of 
strict Pareto optimization in a practical example (evolving web service compositions). 

1.2.3 Constraint Handling 

Such a region of interest is one of the reasons for one further extension of the definition of op- 
timization problems: In many scenarios, p inequality constraints g and q equality constraints 
h may be imposed additional to the objective functions. Then, a solution candidate x is fea- 
sible, if and only if gi{x) > \/i — 1, 2, ..,p and hi(x) = Vi = 1, 2, .., q holds. Obviously, only 
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a feasible individual can be a solution, i. e., an optimum, for a given optimization problem. 
Comprehensive reviews on techniques for such problems have been provided by Michalewicz 
[1406], Michalewicz and Schoenaucr [1410], Ccollo Cocllo [358], and Ceollo Coello et al. [361] 
in the context of Evolutionary Computation. 

Death Penalty 

Probably the easiest way of dealing with constraints is to simply reject all infeasible solution 
candidates right away and not considering them any further in the optimization process. 
This death penalty [1406, 1408] can only work in problems where the feasible regions are 
very large and will lead the search to stagnate in cases where this is not the case. Also, the 
information which could be gained from the infeasible individuals is discarded with them 
and not used during the optimization. 

Penalty Functions 

Maybe one of the most popular approach for dealing with constraints, especially in the 
area of single-objective optimization, goes back to Courant [458] who introduced the idea 
of penalty functions in 1943. Here, the constraints are combined with the objective function 
/, resulting in a new function /' which is then actually optimized. The basic idea is that 
this combination is done in a way which ensures that an infeasible solution candidate has 
always a worse /'-value than a feasible one with the same objective values. In [458], this is 
achieved by defining /' as f'(x) = f(x) +v [h(x)\ . Various similar approaches exist. Carroll 
[345, 346], for instance, chose a penalty function of the form f'(x) — /(i) + m^ =1 [gi(x)}^ 1 
which ensures that the function g does not become zero or negative. 

There are practically no limits for the ways in which a penalty for infeasibility can be 
integrated into the objective functions. Several researchers suggest dynamic penalties which 
incorporate the index of the current iteration of the optimizer [1063, 1560] or adaptive 
penalties which additionally utilize population statistics [1876, 1877, 875, 159]. Rigorous 
discussions on penalty functions have been contributed by Fiacco and McCormick [665] and 
Smith and Coit [1901]. 

Constraints as Additional Objectives 

Another idea for handling constraints would be to consider them as new objective functions. 
If g(x) > must hold, for instance, we can transform this to a new objective function 
f*(x) — min{— g{x) ,0} subject to minimization. The minimum is needed since there is no 
use in maximizing g further than and hence, after it reached 0, the optimization pressure 
must be removed. An approach similar to this is Deb's Goal Programming method [536, 533]. 

The Method of Inequalities 

General inequality constraints can also be processed according to the Method of Inequalities 
(MOT) introduced by Zakian [2304, 2305, 2306, 2307, 2308] in his seminal work on computer- 
aided control systems design (CACSD) [1814, 2200, 2315]. In the MOI, an area of interest 
is specified in form of a goal range [fi,fi] for each objective function /j. 

Pohlheim [1651] outlines how this approach can be combined with Pareto optimization: 
Based on the inequalities, three categories of solution candidates can be defined and each 
element X € X belongs to one of them: 

1. It fulfills all of the goals, i. e., 



n<fi(x)<fi v*e[i,|F|] 



(1.9) 
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2. It fulfills some (but not all) of the goals, i. e., 

(3i € [1, |F|] : f t < Mx) < h) A (3j e [1, |F|] : (/.(x) < r,) V (£(*) > f,)) (1.10) 

3. It fulfills none of the goals, i. c, 

(fi(x) < ft) V (fi(x) > h) Vie [1,|F|] (1.11) 
Using these groups, a new comparison mechanism is created: 

1. The solution candidates that fulfill all goals are preferred instead of all other individuals 
that either fulfill some or no goals. 

2. The solution candidates that are not able to fulfill any of the goals succumb to those 
which fulfill at least some goals. 

3. Only the solutions that are in the same group are compared on basis on the Pareto 
domination relation. 

By doing so, the optimization process will be driven into the direction of the interesting 
part of the Pareto frontier. Less effort will be spent in creating and evaluating individuals 
in parts of the problem space that most probably do not contain any valid solution. 

Graphical Example 1 

In Figure 1.10, we apply the Pareto-based Method of Inequalities to our first graphical 
example. We impose the same goal ranges on both objectives f\ = ?2 and f% — f-i. By 
doing so, the second non-dominated region from the Pareto example Figure 1.8 suddenly 
becomes infeasible, since fx rises over f\ there. Also, the greater part of the first optimal 
area from this example is infeasible because fi drops under r%. In the whole domain X of 
the optimization problem, only the regions [2:1,2:2] and [x3, x4] fulfill all the target criteria. 
To these elements, Pareto comparisons are applied. It turns out that the elements in [x3, x4\ 
dominate all the elements [3:1,0:2] since they provide higher values in f\ for same values in 
/ 2 . If we scan through [x$, X4] from left to right, we can see the fi rises while degenerates, 
which is why the elements in this area cannot dominated each other and, hence, are all 
optimal. 




Graphical Example 2 

In Figure 1.11 we apply the Pareto-based Method of Inequalities to our second graphical 
example from Section 1.2.2. We apply two different ranges of interest [^3, 7^] and [^4,^4] on 
fs and as sketched in Fig. 1.11. a. 
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Fig. 1.11. a: The ranges applied to fs and f^. 




Fig. 1. 11. b: The Pareto-based Method of Inequal- Fig. l.ll.c: The Pareto-based Method of In- 
ities class division. equalities ranking. 



Figure 1.11: Optimization using the Pareto-based Method of Inequalities approach (first 
example) . 



Like we did in the second example for Pareto optimization, we want to plot the quality of 
the elements in the problem space. Therefore, we first assign a number c g {1, 2, 3} to each 
of its elements in Fig. l.ll.b. This number corresponds to the classes to which the elements 
belong, i. e., 1 means that a solution candidate fulfills all inequalities, for an element of class 
2, at least some of the constraints hold, and the elements in class 3 fail all requirements. 
Based on this class division, we can then perform a modified Pareto counting where each 
element dominates all the elements in higher classes Fig. l.ll.c. The result is that multiple 
single optima x^, X2, X3, etc., and even a set of adjacent, non-dominated elements Xg occurs. 
These elements are, again, situated at the bottom of the illustrated landscape whereas the 
worst solution candidates reside on hill tops. 

A good overview on techniques for the Method of Inequalities is given by Whidborne 
et al. [2200]. 

Limitations and Other Methods 

Other approaches for incorporating constraints into optimization are Goal Attainment [2233, 
714] and Goal Programming 1 '^ [377, 376]. Especially interesting in our context are methods 



http : //en . wikipedia . org/ wiki/Goal_programming [accessed 2007-07-03] 
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which have been integrated into evolutionary algorithms [2002, 536, 533, 1804, 1651], such 
as the popular Goal Attainment approach by Fonseca and Fleming [714] which is similar to 
the Pareto-MOI we have adopted from Pohlheim [1651]. Again, an overview on this subject 
is given by Ceollo Coello et al. in [361]. 



1.2.4 Unifying Approaches 
External Decision Maker 

All approaches for defining what optima are and how constraints should be considered are 
rather specific and bound to certain mathematical constructs. The more general concept of 
an External Decision Maker which (or who) decides which solution candidates prevail has 
been introduced by Fonseca and Fleming [715, 716]. One of the ideas behind "externalizing" 
the assessment process on what is good and what is bad is that Pareto optimization imposes 
only a partial order 20 on the solution candidates. In a partial order, elements may exists 
which neither succeed nor precede each other. As we have seen in Section 1.2.2, there can, 
for instance, be two individuals Xx,x% S X with neither x% h xi nor x<i Y- X\. A special 
case of this situation is the non-dominated set, the so-called Pareto frontier which we try 
to estimate with the optimization process. 

Most fitness assignment processes, however, require some sort of total order 21 , where each 
individual is either better or worse than each other (except for the case of identical solution 
candidates which are, of course, equal to each other). The fitness assignment algorithms can 
create such a total order by themselves. One example for doing this is the Pareto ranking 
which we will discuss later in Section 2.3.3 on page 112, where the number of individuals 
dominating a solution candidate denotes its fitness. 

While this method of ordering is a good default approach able of directing the search 
into the direction of the Pareto frontier and delivering a broad scan of it, it neglects the fact 
that the user of the optimization most often is not interested in the whole optimal set but 
has preferences, certain regions of interest [717]. This region will then exclude the infeasible 
(but Pareto optimal) programs for the Artificial Ant as discussed in Section 1.2.2. What the 
user wants is a detailed scan of these areas, which often cannot be delivered by pure Pareto 
optimization. 
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Figure 1.12: An external decision maker providing an evolutionary algorithm with utility 
values. 



Here comes the External Decision Maker as an expression of the user's preferences [712] 
into play, as illustrated in Figure 1.12. The task of this decision maker is to provide a cost 
function u : Y t— ► R (or utility function, if the underlying optimizer is maximizing) which 
maps the space of objective values Y (which is usually R") to the space of real numbers 

20 A definition of partial order relations is specified in Definition 27.31 on page 463. 

21 The concept of total orders is elucidated in Definition 27.32 on page 464. 
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M. Since there is a total order defined on the real numbers, this process is another way 
of resolving the "incomparability-situation" . The structure of the decision making process 
u can freely be defined and may incorporate any of the previously mentioned methods. 
u could, for example, be reduced to compute a weighted sum of the objective values, to 
perform an implicit Pareto ranking, or to compare individuals based on pre-specified goal- 
vectors. Furthermore, it may even incorporate forms of artificial intelligence, other forms of 
multi-criterion Decision Making, and even interaction with the user. This technique allows 
focusing the search onto solutions which are not only optimal in the Pareto sense, but also 
feasible and interesting from the viewpoint of the user. 

Fonscca and Fleming make a clear distinction between fitness and cost values. Cost values 
have some meaning outside the optimization process and are based on user preferences. 
Fitness values on the other hand are an internal construct of the search with no meaning 
outside the optimizer (see Definition 1.35 on page 46 for more details). If External Decision 
Makers are applied in evolutionary algorithms or other search paradigms that are based on 
fitness measures, these will be computed using the values of the cost function instead of the 
objective functions [718, 712, 713]. 

Prevalence Optimization 

We have now discussed various approaches which define optima in terms of multi-objective 
optimization and steer the search process into their direction. Let us subsume all of them in 
general approach. From the concept of Pareto optimization to the Method of Inequalities, 
the need to compare elements of the problem space in terms of their quality as solution 
for a given problem winds like a read thread through this matter. Even the weighted sum 
approach and the External Decision Maker do nothing else than mapping multi-dimensional 
vectors to the real numbers in order to make them comparable. 

If we compare two solution candidates x\ und X2, either x\ is better than X2, vice versa, 
or both are of equal quality. Hence, there are three possible relations between two elements 
of the problem space. These two results can be expressed with a comparator function cmp F . 

Definition 1.15 (Comparator Function). A comparator function cmp : A 2 M maps 
all pairs of elements (01,02) G A 2 to the real numbers ^according to two complementing 
partial orders 22 R\ and R2: 



Ri (and hence, cmp(ai,a 2 ) < 0) is equivalent to the precedence relation and R2 denotes 
succession. 

From the three defining equations, many features of cmp can be deduced. It is, for 
instance, transitive, i. e., cmp(ai,a 2 ) < A cmp(a 2 ,a 3 ) < => cmp(ai,a 3 )) < 0. Provided 
with the knowledge of the objective functions / £ F, such a comparator function cmp F can 
be imposed on the problem spaces of our optimization problems: 

Definition 1.16 (Prevalence Comparator Function). A prevalence comparator func- 
tion cmp F : X 2 R maps all pairs (xi,X2) € X 2 of solution candidates to the real numbers 
K. according to Definition 1.15. 

The subscript F in cmp F illustrates that the comparator has access to all the values of 
the objective functions in addition to the problem space elements which are its parameters. 
As shortcut for this comparator function, we introduce the prevalence notation as follows: 



Ri(a±, a 2 ) cmp(fti, a 2 ) < Vai, 02 € A 
i?2(ai, a 2 ) cmp(oi, a 2 ) > Vai, a 2 € A 



(1.12) 
(1.13) 
(1.14) 
(1.15) 



Ri(a 1 ,a 2 ) A i? 2 (ai, 02) cmp(ai, a 2 ) = Vai, a 2 G A 
cmp(a, a) = Va e A 



Partial orders are introduced in Definition 27.30 on page 463. 
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Definition 1.17 (Prevalence). An element x\ prevails over an element x 2 {x\ >- X2) if 

the application-dependent prevalence comparator function cmp F (xi, x 2 ) £ K returns a value 
less than 0. 



{x\ >- x 2 ) cmp F (ii, x 2 ) < Vaq, x 2 , € X (1-16) 
(aq >- x 2 ) A (x 2 >- xz) => x\ >~ x 3 Vxi,x 2 ,x 3 e X (1-17) 

It is easy to see that we can define Parcto domination relations and Method of 
Inequalities-based comparisons, as well as the weighted sum combination of objective val- 
ues based on this notation. Together with the fitness assignment strategies which will be 
introduced later in this book (see Section 2.3 on page 111), it covers many of the most so- 
phisticated multi-objective techniques that are proposed, for instance, in [715, 1128, 2002]. 
By replacing the Pareto approach with prevalence comparisons, all the optimization algo- 
rithms(especially many of the evolutionary techniques) relying on domination relations can 
be used in their original form while offering the new ability of scanning special regions of 
interests of the optimal frontier. 

Since the comparator function cmp F and the prevalence relation impose a partial order 
on the problem space X like the domination relation does, we can construct the optimal set 
in a way very similar to Equation 1.8: 

x* £ X* &flx £ X : x ^ x* Ax y x* (1.18) 

For illustration purposes, we will exercise the prevalence approach on the examples of the 
weighted sum cmp F F we i g hteds method 23 with the weights Wi as well as on the domination- 
based Pareto optimization 24 cmp F Pareto with the objective directions uif. 



\F\ 

C™VF,weighteds( x l, x 2) = ^{Wifi{x 2 )-Wifi{x{)) = g{x 2 ) - g{x{) (1.19) 

i=l 

{—1 if X\ h x 2 
1 if x 2 h x x (1.20) 
otherwise 

Artificial Ant Example 

With the prevalence comparator, we can also easily solve the problem stated in Section 1.2.2 
by no longer encouraging the evolution of useless programs for Artificial Ants while retaining 
the benefits of Pareto optimization. The comparator function simple can be defined in a 
way that they will always be prevailed by useful programs. It therefore may incorporate the 
knowledge on the importance of the objective functions. Let /i be the objective function 
with an output proportional to the food piled, f 2 would denote the distance covered in 
order to find the food, and / 3 would be the program length. Equation 1.21 demonstrates 
one possible comparator function for the Artificial Ant problem. 



cmpp a „ t (xi,x 2 ) = < 



-1 if (fi(xi) > A fi(x 2 ) — 0)y 
(/ 2 (n)>0A/ 2 (i 2 ) = 0)V 
(h(xi) >0A/i(i 2 ) = 0) 
f if (/iN>0A/ 1 (a; 1 ) = 0)V 
(/2W>0A/ 2 W = 0)V 
(/sW>0A/i(ii) = 0) 
cmp FPareto (x 1 ,x 2 ) otherwise 



(1.21) 



See Equation 1.4 on page 29 for more information on weighted sum optimization. 
24 Pareto optimization was defined in Equation 1.6 on page 31. 
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Later in this book, we will discuss some of the most popular optimization strategies. Al- 
though they are usually implemented based on Pareto optimization, we will always introduce 
them using prevalence. 

1.3 The Structure of Optimization 

After we have discussed what optima are and have seen a crude classification of global 
optimization algorithms, let us now take a look on the general structure common to all 
optimization processes. This structure consists of a number of well-defined spaces and sets 
as well as the mappings between them. Based on this structure of optimization, we will 
introduce the abstractions fitness landscapes, problem landscape, and optimization problem 
which will lead us to a more thorough definition of what optimization is. 

1.3.1 Spaces, Sets, and Elements 

In this section, we elaborate on the relation between the (possibly different) representations 
of solution candidates for search and for evaluation. We will show how these representations 
are connected and introduce fitness as a relative utility measures defined on sets of solution 
candidates. You will find that the general model introduced here applies to all the global 
optimization methods mentioned in this book, often in a simplified manner. One example for 
this structure of optimization processes is given in Figure 1.13 by using a genetic algorithm 
which encodes the coordinates of points in a plane into bit strings as an illustration. 

The Problem Space and the Solutions therein 

Whenever we tackle an optimization problem, we first have to define the type of the pos- 
sible solutions. For deriving a controller for the Artificial Ant problem, we could choose 
programs or artificial neural networks as solution representation. If we are to find the root 
of a mathematical function, we would go for real numbers R as solution candidates and when 
configuring or customizing a car for a sales offer, all possible solutions are elements of the 
power set of all optional features. With this initial restriction to a certain type of results, 
we have specified the problem space X. 

Definition 1.18 (Problem Space). The problem space X (phenome) of an optimization 
problem is the set containing all elements x which could be its solution. 

Usually, more than one problem space can be defined for a given optimization problem. 
A few lines before, we said that as problem space for finding the root of a mathematical 
function, the real number R would be fine. On the other hand, we could as well restrict 
ourselves to the natural numbers N or widen the search to the whole complex plane C. This 
choice has major impact: On one hand, it determines which solutions we can possible find. 
On the other hand, it also has subtle influence on the search operations. Between each two 
different points in R, for instance, there are infinitely many other numbers, while in N, there 
are not. 

In dependence on genetic algorithms, we often refer to the problem space synonymously 
phenome. The problem space X is often restricted by 

1. logical constraints that rule out elements which cannot be solutions, like programs of 
zero length when trying to solve the Artificial Ant problem and 

2. practical constraints that prevent us, for instance, from taking all real numbers into 
consideration in the minimization process of a real function. On our off-the-shelf CPUs 
or with the Java programming language, we can only use 64 bit floating point numbers. 
With these 64 bit, it is only possible to express numbers up to a certain precision and 
we cannot have more than 15 or so decimals. 
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Fitness and heuristic values 
(normally) have only a meaning in the 
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Figure 1.13: Spaces, Sets, and Elements 



involved in an optimization process. 
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Definition 1.19 (Solution Candidate). A solution candidate x is an element of the 
problem space X of a certain optimization problem. 

In the context of evolutionary algorithms, solution candidates are usually called pheno- 
types. In this book, we will use both terms synonymously. Somewhere inside the problem 
space, the solutions of the optimization problem will be located (if the problem can actually 
be solved, that is). 

Definition 1.20 (Solution Space). We call the union of all solutions of an optimization 
problem its solution space §. 

X'CSCX (1.22) 

This solution space contains (and can be equal to) the global optimal set X*. There may 
exist valid solutions x € § which are not elements of the X*, especially in the context of 
constraint optimization (see Section 1.2.3). 

The Search Space 

Definition 1.21 (Search Space). The search space G of an optimization problem is the 
set of all elements g which can be processed by the search operations. 

As previously mentioned, the type of the solution candidates depends on the problem 
to be solved. Since there arc many different applications for optimization, there are many 
different forms of problem spaces. It would be cumbersome to develop search operations 
time and again for each new problem space we encounter. Such an approach would not only 
be error-prone, it would also make it very hard to formulate general laws and to consolidate 
findings. Instead, we often reuse well-known search spaces for many different problems. Then, 
only a mapping between search and problem space has to be defined (see page 44). Although 
this is not always possible, it allows us to use more out-of-the-box software in many cases. 

In dependence on genetic algorithms, we often refer to the search space synonymously as 
genome 25 , a term coined by the German biologist Winkler [2241] as a portmanteau of the 
words gene 26 and chromosome [1267]. The genome is the whole hereditary information of 
organisms. This includes both, the genes and the non-coding sequences of the Deoxyribonu- 
cleic acid (DNA 27 ), which is illustrated in Figure 1.14. Simply put, the DNA is a string of 
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Figure 1.14: A sketch of a part of a DNA molecule. 



base pairs that encodes the phenotypical characteristics of the creature it belongs to. 

25 http://en.wikipedia.org/wiki/Genome [accessed 2007-07-15] 

26 The words gene, genotype, and phenotype have, in turn, been introduced by the Danish biologist 
Johannsen [1056]. [2240] 

27 http://en.wikipedia.org/wiki/Dna [accessed 2007-07-03] 
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Definition 1.22 (Genotype). The elements g £ G of the search space G of a given 
optimization problem are called the genotypes. 

The elements of the search space rarely are unstructured aggregations. Instead, they often 
consist of distinguishable parts, hierarchical units, or well-typed data structures. The same 
goes for the DNA in biology. It consists of genes, segments of nucleic acid, that contain 
the information necessary to produce RNA strings in a controlled manner 28 . A fish, for 
instance, may have a gene for the color of its scales. This gene, in turn, could have two 
possible "values" called alleles 29 , determining whether the scales will be brown or gray. 
The genetic algorithm community has adopted this notation long ago and we can use it for 
arbitrary search spaces. 

Definition 1.23 (Gene). The distinguishable units of information in a genotype that 
encode the phenotypical properties are called genes. 

Definition 1.24 (Allele). An allele is a value of specific gene. 

Definition 1.25 (Locus). The locus in is the position where a specific gene can be found 
in a genotype. 

Figure 1.15 on page 45 refines the relations of genotypes and phenotypes from the 
initial example for the spaces in Figure 1.13 by also marking genes, alleles, and loci. In the 
car customizing problem also mentioned earlier, the first gene could identify the color of 
the automobile. Its locus would then be and it could have the alleles 00, 01, 10, and 11, 
encoding for red, white, green, and blue, for instance. The second gene (at locus 1) with the 
alleles or 1 may define whether or not the car comes with climate control, and so on. 

The Search Operations 

In some problems, the search space G may be identical to the problem space X. If we go back 
to our previous examples, for instance, we will find that there exist a lot of optimization 
strategies that work directly on vectors of real numbers. When minimizing a real function, we 
could use such an approach (Evolution Strategies, for instance, see Chapter 5 on page 227) 
and set G = X = M. Also, the configurations of cars may be represented as bit strings: 
Assume that such a configuration consists of k features, which can either be included or 
excluded from an offer to the customer. We can then search in the space of binary strings of 
this length G = M k = {true, f alse} fc , which is exactly what genetic algorithms (discussed 
in Section 3.1 on page 141) do. By using their optimization capabilities, we do not need 
to mess with the search and selection techniques but can rely on well-researched standard 
operations. 

Definition 1.26 (Search Operations). The search operations searchOp are used by op- 
timization algorithms in order to explore the search space G. 

We subsume all search operations which are applied by an optimization algorithm in 
order to solve a given problem in the set Op. Search operations can be defined with different 
arities 31 . Equation 1.23, for instance, denotes an n-ary operator, i.e., one with n arguments. 
The result of a search operation is one element of the search space. 



searchOp : G™ ^ G 



(1.23) 
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http://en.wikipedia.org/wiki/Gene [accessed 2007-07-03] 
http://en.wikipedia.org/wiki/Allele [accessed 2007-07-03] 
http://en.wikipedia. org/wiki/Locus_°/,28genetics°/,29 [accessed 2007-07-03] 
http://en.wikipedia.org/wiki/Arity [accessed 2008-02-15] 
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Mutation and crossover in genetic algorithms (see Chapter 3) are examples for unary 
and binary search operations, whereas Differential Evolution utilizes a ternary operator (see 
Section 5.5). Optimization processes are often initialized by creating random genotypes - 
usually the results of a search operation with zero arity (no parameters). 

Search operations often involve randomized numbers. In such cases, it makes no sense to 
reason about their results like 3gi , gi £ G : gi — searchOp(<?i) A . . . Instead, we need to work 
with probabilities like 3g\,g 2 € G : #2 = P(searchOp(gi)) > OA. . . Based on Definition 1.26, 
we will use the notation Op(x) for the application of any of the operations searchOp £ Op 
to the genotype x. With Op k (x) we denote k successive applications of (possibly different) 
search operators. If the parameter x is left away, i. e., just Op k is written, this chain has to 
start with a search operation with zero arity. In the style of Badea and Stanciu [111] and 
Skubch [1897, 1898], we now can define: 

Definition 1.27 (Completeness). A set Op of search operations searchOp is complete 
if and only if every point g\ in the search space G can be reached from every other point 
g 2 € G by applying only operations searchOp £ Op. 

Vffi.sa £ G => 3k £ N : P( gi = Op k (g 2 )) > (1.24) 

Definition 1.28 (Weak Completeness). A set Op of search operations searchOp is weakly 
complete if and only if every point g in the search space G can be reached by applying only 
operations searchOp £ Op. A weakly complete set of search operations hence includes at 
least one parameterless function. 

Vg £ G 3k £ N : P(g = Op k ) > (1.25) 

If the set of search operations is not complete, there are points in the search space which 
cannot be reached. Then, we are probably not able to explore the problem space adequately 
and possibly will not find satisfyingly good solution. 

Definition 1.29 (Adjacency (Search Space)). A point 52 is adjacent to a point g\ in 
the search space G if it can be reached by applying a single search operation searchOp to 
g\. Notice that the adjacency relation is not necessarily symmetric. 

.. , , \ f* true if BsearchOp € Op : P(searchOp(gi) = g 2 ) > ,, „„. 
adjacent(. 9 2,.giH j false otherwise (1.26) 

The Connection between Search and Problem Space 

If the search space differs from the problem space, a translation between them is furthermore 
required. In our car example, we would need to transform the binary strings processed by 
the genetic algorithm to objects which represent the corresponding car configurations and 
can be processed by the objective functions. 

Definition 1.30 (Genotype-Phenotype Mapping). The genotype-phenotype mapping 
(GPM, or ontogenic mapping [1619]) gpm : G ^ X is a left-total 32 binary relation which 
maps the elements of the search space G to elements in the problem space X. 

V.g £ G 3x £ X : gpm(.g) = x (1.27) 

The only hard criterion we impose on genotype-phenotype mappings in this book is 
left-totality, i. e., that they map each element of the search space to at least one solution 
candidate. They may be functional relations if they are deterministic. Although it is possible 
to create mappings which involve random numbers and, hence, cannot be considered to be 



See Equation 27.51 on page 461 to 5 on page 462 for an outline of the properties of binary 
relations. 
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Figure 1.15: The relation of genome, genes, and the problem space 
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functions in the mathematical sense of Section 27.7.1 on page 462. Then, Equation 1.27 
would need to be rewritten to Equation 1.28. 

V.g G G 3x G X : P(gpm(. 9 ) = x) > (1.28) 

Genotype-phenotype mappings should further be surjective [1694], i. e., relate at least 
one genotype to each element of the problem space. Otherwise, some solution candidates 
can never be found and evaluated by the optimization algorithm and there is no guarantee 
whether the solution of a given problem can be discovered or not. If a genotype-phenotype 
mapping is injective, which means that it assigns distinct phenotypes to distinct elements 
of the search space, we say that it is free from redundancy. There are different forms of 
redundancy, some are considered to be harmful for the optimization process, others have 
positive influence 33 . Most often, GPMs are not bijective (since they are neither necessarily 
injective nor surjective). Nevertheless, if a genotype-phenotype mapping is bijective, we can 
construct an inverse mapping gpm -1 :XnG. 

gpm _1 (a;) = g ^ gpm(p) = x Mx G X,g G G (1-29) 

Based on the genotype-phenotype mapping, we can also define an adjacency relation for 
the problem space, which, of course, is also not necessarily symmetric. 

Definition 1.31 (Adjacency (Problem Space)). A point X2 is adjacent to a point X\ 
in the problem space X if it can be reached by applying a single search operation searchOp 
to their corresponding elements in the problem space. 



adjacent(x2, x\) 



true if 3g x ,g 2 : x 1 = gpm^x) Ai 2 = gpm(g> 2 ) A adjacent(g 2 , 9i) 
false otherwise 

(1.30) 



By the way, we now have the means to define the term local optimum clearer. The original 
Definition 1.8 only applies to single objective functions, but with the use of the adjacency 
relation adjacent, the prevalence criterion >-, and the connection between the search space 
and the problem space gpm, we clarify it for multiple objectives. 

Definition 1.32 (Local Optimum). A (local) optimum x\ G X of a set of objective 
functions F function is not worse than all points adjacent to it. 

\/x\ eG4 (Van G X : adjacent(a;, x*) x>x\) (1-31) 



See Section 1.4.5 on page 67 for more information. 
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The Objective Space and Optimization Problems 

After the appropriate problem space has been defined, the search space has been selected and 
a translation between them (if needed) was created, we are almost ready to feed the problem 
to a global optimization algorithm. The main purpose of such an algorithm obviously is to 
find as many elements as possible from the solution space - We are interested in the solution 
candidates with the best possible evaluation results. This evaluation is performed by the 
set F of n objective functions / e F, each contributing one numerical value describing the 
characteristics of a solution candidate x. 34 

Definition 1.33 (Objective Space). The objective space Y is the space spanned by the 
codomains of the objective functions. 

F = {fi : X ^ Yi : < i < n, Y t C K} Y = Yi x Y 2 x .. x Y n (1.32) 

The set F maps the elements x of the problem space X to the objective space Y and, 
by doing so, gives the optimizer information about their qualities as solutions for a given 
problem. 

Definition 1.34 (Optimization Problem). An optimization problem is defined by a five- 
tuple (X, F, G, Op, gpm) specifying the problem space X, the objective functions F, the 
search space G, the set of search operations Op, and the genotypc-phenotype mapping gpm. 
In theory, such an optimization problem can always be solved if Op is complete and the 
gpmis surjective. 

Generic search and optimization algorithms find optimal elements if provided with an 
optimization problem defined in this way. Evolutionary algorithms, which we will discuss 
later in this book, are generic in this sense. Other optimization methods, like genetic algo- 
rithmsfor example, may be more specialized and work with predefined search spaces and 
search operations. 

Fitness as a Relative Measure of Utility 

When performing a multi-objective optimization, i. e., n = \F\ > 1, the elements of Y arc 
vectors in R™. In Section 1.2.2 on page 27, we have seen that such vectors cannot always 
be compared directly in a consistent way and that we need some (comparative) measure for 
what is "good" . In many optimization techniques, especially in evolutionary algorithms, this 
measure is used to map the objective space to a subset V of the positive real numbers M. + . 
For each solution candidate, this single real number represents its fitness as solution for the 
given optimization problem. The process of computing such a fitness value is often not solely 
depending on the absolute objective values of the solution candidates but also on those of 
the other phenotypes known. It could, for instance, be position of a solution candidate in the 
list of investigated elements sorted according to the Pareto relation. Hence, fitness values 
often only have a meaning inside the optimization process [712] and may change by time, 
even if the objective values stay constant. In deterministic optimization methods, the value 
of a heuristic function which approximates how many modifications we will have to apply 
to the element in order to reach a feasible solution can be considered as the fitness. 

Definition 1.35 (Fitness). The fitness 35 value v(x) G V of an element x of the problem 
space X corresponds to its utility as solution or its priority in the subsequent steps of the 
optimization process. The space spanned by all possible fitness values V is normally a subset 
of the positive real numbers V C M+. 



See also Equation 1.3 on page 27. 

http : / /en . wikipedia . org/ wiki/Fitness_ (genetic_algorithm) [accessed 2oos-os-io] 
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The origin of the term fitness has been borrowed biology ' 6 [1915, 1624] by the evolution- 
ary algorithms community. When the first applications of genetic algorithms were developed, 
the focus was mainly on single-objective optimization. Back then, they called this single 
function fitness function and thus, set objective value = fitness value. This point of view is 
obsolete in principle, yet you will find many contemporary publications that use this notion. 
This is partly due the fact that in simple problems with only one objective function, the 
old approach of using the objective values directly as fitness, i. e., v(x) = f(x) \/x e X, can 
sometimes actually be applied. In multi-objective optimization processes, this is not possible 
and fitness assignment processes like those which we are going to elaborate on in Section 2.3 
on page 111 are applied instead. 

In the context of this book, fitness is subject to minimization, i. e., elements with smaller 
fitness are "better" than those with higher fitness. Although this definition differs from the 
biological perception of fitness, it complies with the idea that optimization algorithms are 
to find the minima of mathematical functions (if nothing else has been stated) . 

Futher Definitions 

In order to ease the discussions of different global optimization algorithms, we furthermore 
define the data structure individual. Especially evolutionary algorithms, but also many other 
techniques, work on sets of such individuals. Their fitness assignment processes determine 
fitness values for the individuals relative to all elements of these populations. 

Definition 1.36 (Individual). An individual p is a tuple (p.g,p.x) of an element p.g in 
the search space G and the corresponding element p.x — gpmp.g in the problem space X. 

Besides this basic individual structure, many practical realizations of optimization al- 
gorithms use such a data structure to store additional information like the objective and 
fitness values. Then, we will consider individuals as tuples in G x X x Z, where Z is the 
space of the additional information stored Z = Y x V, for instance. In the algorithm defini- 
tions later in this book, we will often access the phenotypes p.x without explicitly using the 
genotype-phenotype mapping, since the relation of p.x and p.g complies to Definition 1.36. 

Definition 1.37 (Population). A population Pop is a list of individuals used during an 
optimization process. 

Pop C G x X : \/p = {p.g, p.x) e Pop => p.x = gpm(p.g) (1.33) 

As already mentioned, the fitness v(x) of an element x in the problem space X often not 
solely depends on the element itself. Normally, it is rather a relative measure putting the 
features of x into the context of a set of solution candidates x. We denote this by writing 
v(x,X). It is also possible that the fitness involves the whole individual data, including the 
genotypic and phenotypic structures. We can denote this by writing v(p, Pop). 

1.3.2 Fitness Landscapes and Global Optimization 

A very powerful metaphor in global optimization is the fitness landscape . Like many other 
abstractions in optimization, fitness landscapes have been developed and extensively been 
researched by evolutionary biologists [2261, 1099, 775, 502]. Basically, they are visualiza- 
tions of the relationship between the genotypes or phenotypes in a given population and 
their corresponding reproduction probability. The idea of such visualizations goes back to 
Wright [2261], who used level contours diagrams in order to outline the effects of selection, 



http://en.wikipedia. org/wiki/Fitness_y,28biology°/,29 [accessed 200S-02-22] 
http : //en . wikipedia . org/ wiki/Fitness_landscape [accessed 2007-07-03] 
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mutation, and crossover on the capabilities of populations to escape local optimal configu- 
rations. Similar abstractions arise in many other areas [1954], like in physics of disordered 
systems like spin-glasses [208, 1402], for instance. 

In Chapter 2, we will discuss evolutionary algorithms, which are optimization methods 
inspired by natural evolution. The evolutionary algorithm research community has widely 
adopted the fitness landscapes as relation between individuals and their objective values 
[1431, 623]. Langdon and Poli [1242] S8 explain that fitness landscapes can be imagined as 
a view on a countryside from far above. The height of each point is then analogous to its 
objective value. An optimizer can then be considered as a short-sighted hiker who tries to 
find the lowest valley or the highest hilltop. Starting from a random point on the map, she 
wants to reach this goal by walking the minimum distance. 

As already mentioned, evolutionary algorithms were first developed as single-objective 
optimization methods. Then, the objective values were directly used as fitness and the 
"reproduction probability", i. e., the chance of a solution candidate for being subject of 
further investigation, was proportional to them. In multi-objective optimization applications 
with more sophisticated fitness assignment and selection processes, this simple approach does 
not reflect the biological metaphor correctly anymore. 

In the context of this book we will book, we therefore deviate from this view. Since 
it would possibly be confusing for the reader if we used a different definition for fitness 
landscapes than the rest of the world, we introduce the new term problem landscape and 
keep using the term fitness landscape in the traditional manner. In Figure 1.19 on page 57, 
you can find some examples for fitness landscapes. 

Definition 1.38 (Problem Landscape). 

The problem landscape $:XxNn [0, 1] C K + maps all the points a; in a problem space 
X to the cumulative probability of reaching them until (inclusively) the r th evaluation of a 
solution candidate. The problem landscape thus depends on the optimization problem and 
on the algorithm applied in order to solve the problem. 

<&(x,t) = P(x has been visited until the r th individual evaluation) Vi £ X, t e N (1-34) 

This definition of problem landscape is very similar to the performance measure defini- 
tion used by Wolpert and Macready [2244, 2245] in their No Free Lunch Theorem which 
will be discussed in Section 1.4.10 on page 76. In our understanding, problem landscapes 
are not only closer to the original meaning of fitness landscapes in biology, they also have 
another advantage. According to this definition, all entities involved in an optimization pro- 
cess directly influence the problem landscape. The choice of the search operations in the 
search space G, the way the initial elements are picked, the genotype-phenotype mapping, 
the objective functions, the fitness assignment process, and the way individuals are selected 
for further exploration all have impact on >P. We can furthermore make the following as- 
sumptions about fer, since it is basically a some form of cumulative distribution function 
(see Definition 28.18 on page 470). 



Referring back to Definition 1.34, we can now also define what optimization algorithms 

are. 

Definition 1.39 (Optimization Algorithm). An optimization algorithm is a transfor- 
mation (X, F, G, Op, gpm) i ► of an optimization problem (X, F, G, Op, gpm) to a problem 
landscape <P that will find at least one local optimum x\ for each optimization problem 

38 This part of [1242] is also online available at http : //www. cs .ucl . ac.uk/ staf f /W . Langdon/FOGP/ 
intro_pic/landscape.html [accessed 2008-02-15]. 



&{x, ti) > &(x, T2) Vti < r 2 a x e X, n, r 2 e N 
< <p(x,t) < 1 Vx e x,t e N 



(1.35) 
(1.36) 
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(X, F, G, Op, gpm) with a weakly complete set of search operations Op and a surjective 
genotype-phenotype mapping gpm if granted infinite processing time and if such an opti- 
mum exists (see Equation 1.37). 

3x^ £ X : lim <P(xj , r) = 1 (1.37) 

T — >OC 

An optimization algorithm is characterized by 

1. the way it assigns fitness to the individuals, 

2. the ways it selects them for further investigation, 

3. the way it applies the search operations, and 

4. the way it builds and treats its state information. 

The first condition in Definition 1.40, the completeness of Op, is mandatory because the 
search space G cannot be explored fully otherwise. If the genotype-phenotype mapping gpm 
is not surjective, there exist points in the problem space X which can never be evaluated. 
Only if both conditions hold, it is guaranteed that an optimization algorithm can find at 
least one local optimum. 

The best optimization algorithm for a given problem (X, F, G, Op, gpm) is the one with 
the highest values of <£(x*, r) for the optimal elements x* in the problem space and for the 
lowest values of r. It may be interesting that this train of thought indicates that finding 
the best optimization algorithm for a given optimization problem is, itself, a multi-objective 
optimization problem. 

Definition 1.40 (Global Optimization Algorithm). Global optimization algorithms 
are optimization algorithms that employs measures that prevent convergence to local optima 
and increase the probability of finding a global optimum. 

For a perfect global optimization algorithm (given an optimization problem with weakly 
complete search operations and a surjective genotype-phenotype mapping), Equation 1.38 
would hold. In reality, it can be considered questionable whether such an algorithm can 
actually be built. 

Vxi,x 2 e X : x x yx 2 lim $(x\,t) > lim $(x 2 , r) (1.38) 




Figure 1.16: An example optimization problem. 



Let us now give a simple example for problem landscapes and how they are influenced by 
the optimization algorithm applied to them. Figure 1.16 illustrates one objective function, 
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defined over a finite subset X of the two-dimensional real plane, which we are going to 
optimize. We use the problem space X also as search space G, so wc can do not need 
a genotype-phenotype mapping. For optimization, we will use a very simple hill climbing 
algorithm 59 , which initially randomly creates one solution candidate uniformly distributed 
in X. In each iteration, it creates a new solution candidate from the known one using an 
unary search operation. The old and the new candidate are compared, and the better one 
is kept. Hence, we do not need to differentiate between fitness and objective values. In the 
example, better means has lower fitness. In Figure 1.16, we can spot one local optimum 
x* and one global optimum x*. Between them, there is a hill, an area of very bad fitness. 
The rest of the problem space exhibits a small gradient into the direction its center. The 
optimization algorithm will likely follow this gradient and sooner or later discover x* or x*. 
The chances of x* are higher, since it is closer to the center of X. 

With this setting, we have recorded the traces of two experiments with 1.3 million runs 
of the optimizer (8000 iterations each). From these records, we can approximate the problem 
landscapes very good. 

In the first experiment, depicted in Figure 1.17, we used a search operation searchOpj : 
IhI which created a new solution candidate normally distributed around the old one. 
In all experiments, we had divided X in a regular lattice. searchOp 2 : X X, used in the 
second experiment, the new solution candidates are direct neighbors of the old ones in this 
lattice. The problem landscape <P produced by this operator is shown in Figure 1.18. Both 
operators are complete, since each point in the search space can be reached from each other 
point by applying them. 



In both experiments, the first probabilities of the elements of the search space of being 
discovered are very low, near to zero in the first few iterations. To put it precise, since our 
problem space is a 36 x 36 lattice, this probability is 1 /36 2 in the first iteration. Starting with 
the tenth or so iteration, small peaks begin to form around the places where the optima are 
located. These peaks grow 

Well, as already mentioned, this idea of problem landscapes and optimization reflects 
solely the author's views. Notice also that it is not always possible to define problem land- 
scapes for problem spaces which are uncountable infinitely large. Since the local optimum 
x^ at the center of the large basin and the gradient points straighter into its direction, it has 
a higher probability of being found than the global optimum x*. The difference between the 
two search operators tested becomes obvious starting with approximately the 2000 t/l itera- 
tion. In the hill climber with the operator utilizing the normal distribution, the # value of 
the global optimum begins to rise farther and farther, finally surpassing the one of the local 
optimum. Even if the optimizer gets trapped in the local optimum, it will still eventually 
discover the global optimum and if we had run this experiment longer, the according proba- 
bility would have converge to 1. The reason for this is that with the normal distribution, all 
points in the search space have a non-zero probability of being found from all other points 
in the search space. In other words, all elements of the search space are adjacent. 

The operator based on the uniform distribution is only able to create points in the direct 
neighborhood of the known points. Hence, if an optimizer gets trapped in the local optimum, 
it can never escape. If it arrives at the global optimum, it will never discover the local one. 
In Fig. 1.18.1, we can see that <2>(x;, 8000) w 0.7 and <£(x*,8000) w 0.3. One of the two 
points will be the result of the optimization process. 

From the example we can draw four conclusions: 

1. Optimization algorithms discover good elements with higher probability than elements 
with bad characteristics. Well, this is what they should do. 



searchOp 1 (a;) = (x\ + random„(), x<i + random„()) 
searchOp 1 (a;) = (x\ + random„(— 1, 1) , x% + random u (— 1, 1)) 



(1.39) 
(1.40) 



Hill climbing algorithms are discussed thoroughly in Chapter 10. 
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Fig. 1.17.c: #(x,5) 





Fig. 1.17.b: <P(x, 2) 




Fig. 1.17.h: <P (x, 1000) 




Fig. 1.17.i: $(x,2000) 



Fig. 1.17.j: <5(x,4000) 




Figure 1.17: The problem landscape of the example problem derived with searchOp 
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Fig. 1.18.e: <P(x,50) 




Fig. 1.18.f: <P(x, 100) 





Figure 1.18: The problem landscape of the example problem derived with searchOp 2 . 
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2. The success of optimization depends very much on the way the search is conducted. 

3. It also depends on the time (or the number of iterations) the optimizer allowed to use. 

4. Hill climbing algorithms are no global optimization algorithms since they have no means 
of preventing getting stuck at local optima. 

1.3.3 Gradient Descend 

Definition 1.41 (Gradient). A gradient 40 of a scalar field / : R™ i— > R is a vector field 
which points into the direction of the greatest increase of the scalar field. It is denoted by 
V/ or grad(/). 

Optimization algorithms depend on some form of gradient in objective or fitness space in 
order to find good individuals. In most cases, the problem space X is not a vector space over 
the real numbers R, so we cannot directly differentiate the objective functions with Nabla 
operator 41 VF. Generally, samples of the search space are used to approximate the gradient. 
If we compare to elements x\ and a; 2 of problem space and find x\ >~ X2, we can assume 
that there is some sort of gradient facing downwards from X2 to x\. When descending this 
gradient, we can hope to find an x 3 with x 3 >- x\ and finally the global minimum. 

1.3.4 Other General Features 

There are some further common semantics and operations that are shared by most op- 
timization algorithms. Many of them, for instance, start out by randomly creating some 
initial individuals which are then refined iteratively. Optimization processes which are not 
allowed to run infinitely have to find out when to terminate. In this section we define and 
discuss general abstractions for such commonalities. 

Iterations 

Global optimization algorithms often iteratively evaluate solution candidates in order to 
approach the optima. We distinguish between evaluations r and iterations t. 

Definition 1.42 (Evaluation). The value t £ No denotes the number of solution candi- 
dates for which the set of objective functions F has been evaluated. 

Definition 1.43 (Iteration). An iteration 42 refers to one round in a loop of an algorithm. 
It is one repetition of a specific sequence of instruction inside an algorithm. 

Algorithms are referred to as iterative if most of their work is done by cyclic repetition 
of one main loop. In the context of this book, an iterative optimization algorithm starts 
with the first step t = 0. The value t € No is the index of the iteration currently performed 
by the algorithm and t + 1 refers to the following step. One example for iterative algorithm 
is Algorithm 1.1. In some optimization algorithms like genetic algorithms, for instance, 
iterations are referred to as generations. 

There often exists a well-defined relation between the number of performed solution 
candidate evaluations r and the index of the current iteration t in an optimization process: 
Many global optimization algorithms generate and evaluate a certain number of individuals 
per generation. 

40 http://en.wikipedia.org/wiki/Gradient [accessed 2007-11-06] 

41 http://en.wikipedia.org/wiki/Del [accessed 2008-02-15] 

42 http://en.wikipedia.org/wiki/Iteration [accessed 2007-07-03] 
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Termination Criterion 

The termination criterion terminationCriterionQ is a function with access to all the infor- 
mation accumulated by an optimization process, including the number of performed steps 
t, the objective values of the best individuals, and the time elapsed since the start of the 
process. With terminationCriterion(), the optimizers determine when they have to halt. 

Definition 1.44 (Termination Criterion). When the termination criterion function 
terminationCriterionQ G {true, false} evaluates to true, the optimization process will 
stop and return its results. 

Some possible criteria that can be used to decide whether an optimizer should terminate 
or not are [1975, 1634, 2325, 2326]: 

1. The user may grant the optimization algorithm a maximum computation time. If this 
time has been exceeded, the optimizer should stop. Here we should note that the time 
needed for single individuals may vary, and so will the times needed for iterations. Hence, 
this time threshold can sometimes not be abided exactly. 

2. Instead of specifying a time limit, a total number of iterations i or individual evaluations 
f may be specified. Such criteria are most interesting for the researcher, since she often 
wants to know whether a qualitatively interesting solution can be found for a given 
problem using at most a predefined number of samples from the problem space. 

3. An optimization process may be stopped when no improvement in the solution quality 
could be detected for a specified number of iterations. Then, the process most probably 
has converged to a (hopefully good) solution and will most likely not be able to make 
further progress. 

4. If we optimize something like a decision maker or classifier based on a sample data set, 
we will normally divide this data into a training and a test set. The training set is used 
to guide the optimization process whereas the test set is used to verify its results. We can 
compare the performance of our solution when fed with the training set to its properties 
if fed with the test set. This comparison helps us detect when most probably no further 
generalization can be achieved by the optimizer and we should terminate the process. 

5. Obviously, we can terminate an optimization process if it has already yielded a suffi- 
ciently good solution. 

In practical applications, we can apply any combination of the criteria above in order to 
determine when to halt. How the termination criterion is tested in an iterative algorithm is 
illustrated in Algorithm 1.1. 



Algorithm 1.1: Example Iterative Algorithm 



Input: [implicit] terminationCriterion(): the termination criterion 
Data: t: the iteration counter 



begin 

t <- 







// initialize the data of the algorithm 
while terminationCriterionQ do 

// perform one iteration - here happens the magic 

t< t + 1 



5 end 
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Minimization 

Many optimization algorithms have been developed for single-objective optimization in their 
original form. Such algorithms may be used for both, minimization or maximization. Without 
loss of generality we will present them as minimization processes since this is the most 
commonly used notation. An algorithm that maximizes the function / may be transformed 
to a minimization using — / instead. 

Note that using the prevalence comparisons as introduced in Section 1.2.4 on page 38, 
multi-objective optimization processes can be transformed into single-objective minimization 
processes. Therefore x\ >~ x 2 cmp F (a;i, X2) < 0. 

Modeling and Simulating 

While there are a lot of problems where the objective functions are mathematical expressions 
that can directly be computed, there exist problem classes far away from such simple function 
optimization that require complex models and simulations. 

Definition 1.45 (Model). A model 4 ' 5 is an abstraction or approximation of a system that 
allows us to reason and to deduce properties of the system. 

Models are often simplifications or idealization of real-world issues. They are defined by 
leaving away facts that probably have only minor impact on the conclusions drawn from 
them. In the area of global optimization, we often need two types of abstractions: 

1. The models of the potential solutions shape the problem space X. Examples are 

a) programs in Genetic Programming, for example for the Artificial Ant problem, 

b) construction plans of a skyscraper, 

c) distributed algorithms represented as programs for Genetic Programming, 

d) construction plans of a turbine, 

e) circuit diagrams for logical circuits, and so on. 

2. Models of the environment in which we can test and explore the properties of the po- 
tential solutions, like 

a) a map on which the Artificial Ant will move which is driven by the evolved program, 

b) an abstraction from the environment in which the skyscraper will be built, with wind 
blowing from several directions, 

c) a model of the network in which the evolved distributed algorithms can run, 

d) a physical model of air which blows through the turbine, 

e) the model of an energy source the other pins which will be attached to the circuit 
together with the possible voltages on these pins. 

Models themselves are rather static structures of descriptions and formulas. Deriving 
concrete results (objective values) from them is often complicated. It often makes more 
sense to bring the construction plan of a skyscraper to life in a simulation. Then we can test 
the influence of various wind strengths and directions on building structure and approximate 
the properties which define the objective values. 

Definition 1.46 (Simulation). A simulation 44 is the computational realization of a model. 
Whereas a model describes abstract connections between the properties of a system, a sim- 
ulation realizes these connections. 

Simulations are executable, live representations of models that can be as meaningful as 
real experiments. They allow us to reason if a model makes sense or not and how certain 
objects behave in the context of a model. 



http : / /en . wikipedia . org/ wiki/Model_ 1 /,28abstract°/ 1 29 [accused 2007-07-03] 
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1.4 Problems in Optimization 



1.4.1 Introduction 

The classification of optimization algorithms in Section 1.1.1 and the table of contents of this 
book enumerate a wide variety of optimization algorithms. Yet, the approaches introduced 
here resemble only a small fraction of the actual number of available methods. It is a justified 
question to ask why there are so many different approaches, why is this variety needed? One 
possible answer is simply because there are so many different kinds of optimization tasks. 
Each of them puts different obstacles into the way of the optimizers and comes with own, 
characteristic difficulties. 

In this chapter we want to discuss the most important of these complications, the major 
problems that may be encountered during optimization. Some of subjects in the following 
text concern global optimization in general (multi- modality and overfitting, for instance), 
others apply especially to nature-inspired approaches like genetic algorithms (epistasis and 
neutrality, for example). Neglecting even a single one them during the design or process of 
optimization can render the whole efforts invested useless, even if highly efficient optimiza- 
tion techniques are applied. By giving clear definitions and comprehensive introductions to 
these topics, we want to raise the awareness of scientists and practitioners in the industry 
and hope to help them to use optimization algorithms more efficiently. 

In Figure 1.19, we have sketched a set of different types of fitness landscapes (see Sec- 
tion 1.3.2) which we are going to discuss. The objective values in the figure are subject to 
minimization and the small bubbles represent solution candidates under investigation. An 
arrow from one bubble to another means that the second individual is found by applying 
one search operation to the first one. 

The Term "Difficult" 

Before we go more into detail about what makes these landscapes difficult, we should es- 
tablish the term in the context of optimization. The degree of difficulty of solving a certain 
problem with a dedicated algorithm is closely related to its computational complexity 45 , i.e., 
the amount of resources such as time and memory required to do so. The computational com- 
plexity depends on the number of input elements needed for applying the algorithm. This 
dependency is often expressed in form of approximate boundaries with the Big-O-family 
notations introduced by Bachmann [96] and made popular by Landau [1236]. Problems can 
further be divided into complexity classes. One of the most difficult complexity classes own- 
ing to its resource requirements is NT , the set of all decision problems which are solvable 
in polynomial time by non-deterministic Turing machines [773]. Although many attempts 
have been made, no algorithm has been found which is able to solve an A/""P-complctc 
[773] problem in polynomial time on a deterministic computer. One approach to obtaining 
near-optimal solutions for problems in MV in reasonable time is to apply metaheuristic, 
randomized optimization procedures. 

As already stated, optimization algorithms are guided by objective functions. A function 
is difficult from a mathematical perspective in this context if it is not continuous, not 
differentiable, or if it has multiple maxima and minima. This understanding of difficulty 
comes very close to the intuitive sketches in Figure 1.19. 

In il world applications of metaheuristic optimization, the characteristics of 

the objective functions are not known in advance. The problems are usually MV or have 



see Section 30.1.3 on page 550 



1.4 Problems in Optimization 57 




Fig. 1.19.c: Multimodal Fig. 1.19.d: Rugged 




Fig. 1.19.g: Needle-In-A-Haystack Fig. 1.19.h: Nightmare 



Figure 1.19: Different possible properties of fitness landscapes (minimization). 
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unknown complexity. It is therefore only rarely possible to derive boundaries for the perfor- 
mance or the runtime of optimizers in advance, let alone exact estimates with mathematical 
precision. 

Most often, experience, rules of thumb, and empirical results based on models obtained 
from related research areas such as biology are the only guides available. In this chapter, 
we discuss many such models and rules, providing a better understanding of when the 
application of a metaheuristic is feasible and when not, as well as with indicators on how to 
avoid defining problems in a way that makes them difficult. 

1.4.2 Premature Convergence 
Introduction 

An optimization algorithm has converged if it cannot reach new solution candidates anymore 
or if it keeps on producing solution candidates from a "small" 46 subset of the problem space. 
Meta-heuristc global optimization algorithms will usually converge at some point in time. 
In nature, a similar phenomenon can be observed according to [1196]: The niche preemption 
principle states that a niche in a natural environment tends to become dominated by a single 
species [1347]. One of the problems in global optimization (and basically, also in nature) is 
that it is often not possible to determine whether the best solution currently known is a 
situated on local or a global optimum and thus, if convergence is acceptable. In other words, 
it is usually not clear whether the optimization process can be stopped, whether it should 
concentrate on refining the current optimum, or whether it should examine other parts of 
the search space instead. This can, of course, only become cumbersome if there are multiple 
(local) optima, i. e., the problem is multimodal as depicted in Fig. 1.19.C. 

A mathematical function is multimodal if it has multiple maxima or minima [1863, 2327, 
512]. A set of objective functions (or a vector function) F is multimodal if it has multiple 
(local or global) optima - depending on the definition of "optimum" in the context of the 
corresponding optimization problem. 

The Problem 

An optimization process has prematurely converged to a local optimum if it is no longer able 
to explore other parts of the search space than the area currently being examined and there 
exists another region that contains a superior solution [2075, 1824]. Figure 1.20 illustrates 
examples for premature convergence. 

The existence of multiple global optima itself is not problematic and the discovery of 
only a subset of them can still be considered as successful in many cases. The occurrence of 
numerous local optima, however, is more complicated. 

Domino Convergence 

The phenomenon of domino convergence has been brought to attention by Rudnick [1773] 
who studied it in the context of his Binlnt problem [1773, 2036] which is discussed in Sec- 
tion 21.2.5. In principle, domino convergence occurs when the solution candidates have fea- 
tures which contribute to significantly different degrees to the total fitness. If these features 
are encoded in separate genes (or building blocks) in the genotypes, they are likely to be 
treated with different priorities, at least in randomized or heuristic optimization methods. 

Building blocks with a very strong positive influence on the objective values, for instance, 
will quickly be adopted by the optimization process (i.e., "converge"). During this time, the 
alleles of genes with a smaller contribution are ignored. They do not come into play until 



according to a suitable metric like numbers of modifications or mutations which need to be 
applied to a given solution in order to leave this subset 




Fig. 1.20. a: Example 1: Maximization Fig. 1.20.b: Example 2: Minimization 



Figure 1.20: Premature convergence in the objective space. 



the optimal alleles of the more "important" blocks have been accumulated. Rudnick [1773] 
called this sequential convergence phenomenon domino convergence due to its resemblance 
to a row of falling domino stones [2036] . 

Let us consider the application of a genetic algorithm in such a scenario. Mutation 
operators from time to time destroy building blocks with strong positive influence which 
are then reconstructed by the search. If this happens with a high enough frequency, the 
optimization process will never get to optimize the lower salient blocks because repairing 
and rediscovering those with higher importance takes precedence. Thus, the mutation rate 
of the EA limits the probability of finding the global optima in such a situation. 

In the worst case, the contributions of the less salient genes may almost look like noise and 
they are not optimized at all. Such a situation is also an instance of premature convergence, 
since the global optimum which would involve optimal configurations of all building blocks 
will not be discovered. In this situation, restarting the optimization process will not help 
because it will always turn out the same way. Example problems which are often likely to 
exhibit domino convergence are the Royal Road and the aforementioned Binlnt problem, 
which you can find discussed in Section 21.2.4 and Section 21.2.5, respectively. 

One Cause: Loss of Diversity 

In biology, diversity is the variety and abundance of organisms at a given place and time 
[1598, 1348]. Much of the beauty and efficiency of natural ecosystems is based on a dazzling 
array of species interacting in manifold ways. Diversification is also a good strategy utilized 
by investors in the economy in order to increase their wealth. 

In population-based global optimization algorithms, maintaining a set of diverse solution 
candidates is very important as well. Losing diversity means approaching a state where all 
the solution candidates under investigation are similar to each other. Another term for this 
state is convergence. Discussions about how diversity can be measured have been provided 
by Routledge [1771], Cousins [459], Magurran [1348], Morrison and De Jong [1462], Paenke 
et al. [1598], and Burke et al. [309, 311]. 

Preserving diversity is directly linked with maintaining a good balance between exploita- 
tion and exploration [1598] and has been studied by researchers from many domains, such 
as 

1. Genetic Algorithms [1558, 1750, 1751], 

2. Evolutionary Algorithms [253, 254, 1262, 1471, 1943, 1892], 
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3. Genetic Programming [510, 871, 872, 310, 311, 273], 

4. Tabu Search [812, 816], and 

5. Particle Swarm Optimization [2226]. 

Exploration vs. Exploitation 

The operations which create new solutions from existing ones have a very large impact on 
the speed of convergence and the diversity of the populations [637, 1910]. The step size in 
Evolution Strategy is a good example of this issue: setting it properly is very important and 
leads over to the "exploration versus exploitation" problem [940] which can be observed in 
other areas of global optimization as well. 47 

In the context of optimization, exploration means finding new points in areas of the 
search space which have not been investigated before. Since computers have only limited 
memory, already evaluated solution candidates usually have to be discarded in order to 
accommodate the new ones. Exploration is a metaphor for the procedure which allows search 
operations to find novel and maybe better solution structures. Such operators (like mutation 
in evolutionary algorithms) have a high chance of creating inferior solutions by destroying 
good building blocks but also a small chance of finding totally new, superior traits (which, 
however, is not guaranteed at all). 

Exploitation, on the other hand, is the process of improving and combining the traits of 
the currently known solutions, as done by the crossover operator in evolutionary algorithms, 
for instance. Exploitation operations often incorporate small changes into already tested 
individuals leading to new, very similar solution candidates or try to merge building blocks 
of different, promising individuals. They usually have the disadvantage that other, possibly 
better, solutions located in distant areas of the problem space will not be discovered. 

Almost all components of optimization strategies can either be used for increasing ex- 
ploitation or in favor of exploration. Unary search operations that improve an existing so- 
lution in small steps can often be built, hence being exploitation operators. They can also 
be implemented in a way that introduces much randomness into the individuals, effectively 
making them exploration operators. Selection operations 48 in Evolutionary Computation 
choose a set of the most promising solution candidates which will be investigated in the 
next iteration of the optimizers. They can either return a small group of best individuals 
(exploitation) or a wide range of existing solution candidates (exploration) . 

Optimization algorithms that favor exploitation over exploration have higher convergence 
speed but run the risk of not finding the optimal solution and may get stuck at a local 
optimum. Then again, algorithms which perform excessive exploration may never improve 
their solution candidates well enough to find the global optimum or it may take them 
very long to discover it "by accident". A good example for this dilemma is the Simulated 
Annealing algorithm discussed in Chapter 12 on page 263. It is often modified to a form called 
simulated quenching which focuses on exploitation but loses the guaranteed convergence to 
the optimum. Generally, optimization algorithms should employ at least one search operation 
of explorative character and at least one which is able to exploit good solutions further. There 
exists a vast body of research on the trade-off between exploration and exploitation that 
optimization algorithms have to face [638, 945, 622, 1494, 49, 538]. 

Countermeasures 

There is no general approach which can prevent premature convergence. The probability 
that an optimization process gets caught in a local optimum depends on the characteristics 
of the problem to be solved and the parameter settings and features of the optimization 
algorithms applied [2051, 1775]. 

47 More or less synonymously to exploitation and exploration, the terms intensifications and diver- 
sification have been introduced by Glover [812, 816] in the context of Tabu Search. 

48 Selection will be discussed in Section 2.4 on page 121. 
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A very crude and yet, sometimes effective measure is restarting the optimization pro- 
cess at randomly chosen points in time. One example for this method is GRASPs, Greedy 
Randomized Adaptive Search Procedures [663, 652] (see Section 10.6 on page 256), which con- 
tinuously restart the process of creating an initial solution and refining it with local search. 
Still, such approaches are likely to fail in domino convergence situations. Increasing the 
proportion of exploration operations may also reduce the chance of premature convergence. 

In order to extend the duration of the evolution in evolutionary algorithms, many meth- 
ods have been devised for steering the search away from areas which have already been 
frequently sampled. This can be achieved by integrating density metrics into the fitness 
assignment process. The most popular of such approaches are sharing and niching (see Sec- 
tion 2.3.4). The Strength Pareto Algorithms, which arc widely accepted to be highly efficient, 
use another idea: they adapt the number of individuals that one solution candidate dom- 
inates as density measure [2329, 2332]. One very simple method aiming for convergence 
prevention is introduced in Section 2.4.8. Using low selection pressure furthermore decreases 
the chance of premature convergence but also decreases the speed with which good solutions 
are exploited. 

Another approach against premature convergence is to introduce the capability of self- 
adaptation, allowing the optimization algorithm to change its strategies or to modify its 
parameters depending on its current state. Such behaviors, however, are often implemented 
not in order to prevent premature convergence but to speed up the optimization process 
(which may lead to premature convergence to local optima) [1776, 1777, 1778]. 

1.4.3 Ruggedness and Weak Causality 



The Problem: Ruggedness 

Optimization algorithms generally depend on some form of gradient in the objective or 
fitness space. The objective functions should be continuous and exhibit low total variation 49 , 
so the optimizer can descend the gradient easily. If the objective functions are unsteady 
or fluctuating, i. e., going up and down, it becomes more complicated for the optimization 
process to find the right directions to proceed to. The more rugged a function gets, the harder 
it becomes to optimize it. For short, one could say ruggedness is multi- modality plus steep 
ascends and descends in the fitness landscape. Examples of rugged landscapes are Kauffman's 
NK fitness landscape (see Section 21.2.1), the p-Spin model discussed in Section 21.2.2, 
Bergman and Feldman's jagged fitness landscape [182], and the sketch in Fig. 1.19.d on 
page 57. 

One Cause: Weak Causality 

During an optimization process, new points in the search space are created by the search 
operations. Generally we can assume that the genotypes which are the input of the search 
operations correspond to phenotypes which have previously been selected. Usually, the better 
or the more promising an individual is, the higher are its chances of being selected for further 
investigation. Reversing this statement suggests that individuals which are passed to the 
search operations are likely to have a good fitness. Since the fitness of a solution candidate 
depends on its properties, it can be assumed that the features of these individuals are not so 
bad either. It should thus be possible for the optimizer to introduce slight changes to their 
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properties in order to find out whether they can be improved any further . Normally, such 
exploitive modifications should also lead to small changes in the objective values and hence, 
in the fitness of the solution candidate. 

Definition 1.47 (Strong Causality). Strong causality (locality) means that small 
changes in the properties of an object also lead to small changes in its behavior [1713, 
1714, 1759]. 

This principle (proposed by Rechenberg [1713, 1714]) should not only hold for the search 
spaces and operations designed for optimization, but applies to natural genomes as well. The 
offspring resulting from sexual reproduction of two fish, for instance, has a different genotype 
than its parents. Yet, it is far more probable that these variations manifest in a unique color 
pattern of the scales, for example, instead of leading to a totally different creature. 

Apart from this straightforward, informal explanation here, causality has been investi- 
gated thoroughly in different fields of optimization, such as Evolution Strategy [1713, 597], 
structure evolution [1303, 1302], Genetic Programming [1758, 1759, 1007, 597], genotype- 
phenotype mappings [1854], search operators [597], and evolutionary algorithms in general 
[1955, 1765, 597]. 

In fitness landscapes with weak (low) causality, small changes in the solution candidates 
often lead to large changes in the objective values, i. e., ruggedncss. It then becomes harder 
to decide which region of the problem space to explore and the optimizer cannot find reliable 
gradient information to follow. A small modification of a very bad solution candidate may 
then lead to a new local optimum and the best solution candidate currently known may be 
surrounded by points that are inferior to all other tested individuals. 

The lower the causality of an optimization problem, the more rugged its fitness landscape 
is, which leads to a degeneration of the performance of the optimizer [1168]. This does not 
necessarily mean that it is impossible to find good solutions, but it may take very long to 
do so. 

Fitness Landscape Measures 

As measures for the ruggedness of a fitness landscape (or their general difficulty), many 
different metrics have been proposed. Wedge and Kell [2164] and Altenberg [45] provide 
nice lists of them in their work 51 , which we summarize here: 

• Weinberger [2169] introduced the autocorrelation function and the correlation length of 
random walks. 

• The correlation of the search operators was used by Mandcrick et al. [1354] in conjunction 
with the autocorrelation. 

• Jones and Forrest [1070, 1069] proposed the fitness distance correlation (FDC), the corre- 
lation of the fitness of an individual and its distance to the global optimum. This measure 
has been extended by researchers such as Clergue et al. [416, 2103]. 

• The probability that search operations create offspring fitter than their parents, as defined 
by Rechenberg [1713] and Beyer [196] (and called evolvability by Altenberg [42]), will be 
discussed in Section 1.4.5 on page 65 in depth. 

• Simulation dynamics have been researched by Altenberg [42] and Grefenstette [855]. 

• Another interesting metric is the fitness variance of formae (Radcliffe and Surry [1695]) 
and schemas (Reeves and Wright [1717]). 

• The error threshold method from theoretical biology [625, 1552] has been adopted Ochoa 
et al. [1557] for evolutionary algorithms. It is the "critical mutation rate beyond which 
structures obtained by the evolutionary process are destroyed by mutation more fre- 
quently than selection can reproduce them" [1557]. 

We have already mentioned this under the subject of exploitation. 
51 Especially the one of Wedge and Kell [2164] is beautiful and far more detailed than this summary 
here. 
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• The negative slope coefficient (NSC) by Vanneschi et al. [2104, 2105] may be considered 
as an extension of Altenberg's evolvability measure. 

• Davidor [489] uses the epistatic variance as a measure of utility of a certain representation 
in genetic algorithms. We discuss the issue of epistasis in Section f .4.6. 

• The genotype-fitness correlation (GFC) of Wedge and Kell [2164] is a new measure for 
ruggedness in fitness landscape and has been shown to be a good guide for determining 
optimal population sizes in Genetic Programming. 

Autocorrelation and Correlation Length 

As example, let us take a look at the autocorrelation function as well as the correlation 
length of random walks [2169]. Here we borrow its definition from Verel et al. [2114]: 

Definition 1.48 (Autocorrelation Function). Given a random walk (xi,x i+ i, . . . ), the 
autocorrelation function p of an objective function / is the autocorrelation function of the 
time scries (f(xi) , f(x l+1 ) ,...). 

n(k n E[f( Xl ) f{x l+k )}- E[f{ Xl )]E[f{x t+k )] 

P[kJ) - D*\J{ Xi )\ {lAl) 

where E[f(xi)] and D 2 [f(xi)] are the expected value and the variance of f(xi). 

The correlation length r = — J) measures how the autocorrelation function de- 
creases and summarizes the ruggedness of the fitness landscape: the larger the correlation 
length, the lower the total variation of the landscape. From the works of Kinnear, Jr. [1141] 
and Lipsitch [1293] from 18, however, we also know that correlation measures do not always 
represent the hardness of a problem landscape full. 



Countermeasures 

To the knowledge of the author, no viable method which can directly mitigate the effects of 
rugged fitness landscapes exists. In population-based approaches, using large population sizes 
and applying methods to increase the diversity can reduce the influence of ruggedness, but 
only up to a certain degree. Utilizing Lamarckian evolution [522, 2215] or the Baldwin effect 
[123, 929, 930, 2215], i. e., incorporating a local search into the optimization process, may 
further help to smoothen out the fitness landscape [864] (see Section 15.2 and Section 15.3, 
respectively). 

Weak causality is often a home-made problem because it results to some extent from 
the choice of the solution representation and search operations. We pointed out that explo- 
ration operations are important for lowering the risk of premature convergence. Exploitation 
operators are as same as important for refining solutions to a certain degree. In order to 
apply optimization algorithms in an efficient manner, it is necessary to find representations 
which allow for iterative modifications with bounded influence on the objective values, i. c., 
exploitation. In Section 1.5.2, we present some further rules-of-thumb for search space and 
operation design. 
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Introduction 

Especially annoying fitness landscapes show deceptiveness (or deceptivity) . The gradient of 
deceptive objective functions leads the optimizer away from the optima, as illustrated in 
Fig. 1.19.C. 

The term deceptiveness is mainly used in the genetic algorithm 52 community in the 
context of the Schema Theorem. Schemas describe certain areas (hyperplanes) in the search 
space. If an optimization algorithm has discovered an area with a better average fitness 
compared to other regions, it will focus on exploring this region based on the assumption 
that highly fit areas are likely to contain the true optimum. Objective functions where this 
is not the case are called deceptive [190, 821, 1285]. Examples for deceptiveness are the ND 
fitness landscapes outlined in Section 21.2.3, trap functions (see Section 21.2.3), and the 
fully deceptive problems given by Goldberg et al. [825, 541]. 

The Problem 

If the information accumulated by an optimizer actually guides it away from the optimum, 
search algorithms will perform worse than a random walk or an exhaustive enumeration 
method. This issue has been known for a long time [2159, 1433, 1434, 2034] and has been 
subsumed under the No Free Lunch Theorem which wewill discuss in Section 1.4.10. 

Countermeasures 

Solving deceptive optimization tasks perfectly involves sampling many individuals with very 
bad features and low fitness. This contradicts the basic ideas of metaheuristics and thus, 
there are no efficient countermeasures against deceptivity. Using large population sizes, main- 
taining a very high diversity, and utilizing linkage learning (see Section 1.4.6) are, maybe, 
the only approaches which can provide at least a small chance of finding good solutions. 

1.4.5 Neutrality and Redundancy 
The Problem: Neutrality 

Definition 1.49 (Neutrality). We consider the outcome of the application of a search 
operation to an element of the search space as neutral if it yields no change in the objective 
values [1718, 149]. 

It is challenging for optimization algorithms if the best solution candidate currently 
known is situated on a plane of the fitness landscape, i. e., all adjacent solution candidates 
have the same objective values. As illustrated in Fig. 1.19.f, an optimizer then cannot find 
any gradient information and thus, no direction in which to proceed in a systematic manner. 
From its point of view, each search operation will yield identical individuals. Furthermore, 
optimization algorithms usually maintain a list of the best individuals found, which will then 
overflow eventually or require pruning. 

The degree of neutrality v is defined as the fraction of neutral results among all possible 
products of the search operations applied to a specific genotype [149]. We can generalize 
this measure to areas G in the search space G by averaging over all their elements. Regions 
where v is close to one are considered as neutral. 
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We are going to discuss genetic algorithms in Chapter 3 on page 141 and the Schema Theorem 
in Section 3.6 on page 150. 
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Evolvability 

Another metaphor in global optimization borrowed from biological systems is evolvability 5 ' 
[500]. Wagner [2132, 2133] points out that this word has two uses in biology: According 
to Kirschner and Gerhart [1144], a biological system is evolvable if it is able to generate 
heritable, selectable phenotypic variations. Such properties can then be spread by natural 
selection and changed during the course of evolution. In its second sense, a system is evolvable 
if it can acquire new characteristics via genetic change that help the organism(s) to survive 
and to reproduce. Theories about how the ability of generating adaptive variants has evolved 
have been proposed by Riedl [1732], Altenberg [43], Wagner and Altenberg [2134], Bonner 
[247], and Conrad [439], amongst others. The idea of evolvability can be adopted for global 
optimization as follows: 

Definition 1.50 (Evolvability). The evolvability of an optimization process in its current 
state defines how likely the search operations will lead to solution candidates with new (and 
eventually, better) objectives values. 

The direct probability of success [1713, 196], i.e., the chance that search operators produce 
offspring fitter than their parents, is also sometimes referred to as evolvability in the context 
of evolutionary algorithms [45, 42]. 



Neutrality: Problematic and Beneficial 

The link between evolvability and neutrality has been discussed by many researchers [2300, 
2133]. The evolvability of neutral parts of a fitness landscape depends on the optimization 
algorithm used. It is especially low for hill climbing and similar approaches, since the search 
operations cannot directly provide improvements or even changes. The optimization process 
then degenerates to a random walk, as illustrated in Fig. 1.19.f on page 57. The work of 
Beaudoin et al. [161] on the ND fitness landscapes ' 4 shows that neutrality may "destroy" 
useful information such as correlation. 

Researchers in molecular evolution, on the other hand, found indications that the major- 
ity of mutations in biology have no selective influence [732, 980] and that the transformation 
from genotypes to phenotypes is a many-to-one mapping. Wagner [2133] states that neutral- 
ity in natural genomes is beneficial if it concerns only a subset of the properties peculiar to 
the offspring of a solution candidate while allowing meaningful modifications of the others. 
Toussaint and Igel [2050] even go as far as declaring it a necessity for self-adaptation. 

The theory of punctuated equilibria 50 , in biology introduced by Eldredge and Gould 
[630, 629], states that species experience long periods of evolutionary inactivity which are 
interrupted by sudden, localized, and rapid phenotypic evolutions [118]. It is assumed that 
the populations explore neutral layers 57 during the time of stasis until, suddenly, a relevant 
change in a genotype leads to a better adapted phenotype [2098] which then reproduces 
quickly. Similar phenomena can be observed/are utilized in EAs [426, 1365]. 

"Uh?" , you may think, "How docs this fit together?" The key to differentiating between 
"good" and "bad" neutrality is its degree v in relation to the number of possible solutions 
maintained by the optimization algorithms. Smith et al. [1913] have used illustrative ex- 
amples similar to Figure 1.21 showing that a certain amount of neutral reproductions can 
foster the progress of optimization. In Fig. 1.21. a, basically the same scenario of premature 
convergence as in Fig. 1.20. a on page 59 is depicted. The optimizer is drawn to a local opti- 
mum from which it cannot escape anymore. Fig. 1.21. b shows that a little shot of neutrality 

53 http://en.wikipedia.org/wiki/Evolvability [accessed 2007-07-03] 

54 See Section 21.2.3 on page 333 for a detailed elaboration on the ND fitness landscape. 

55 http://en.wikipedia.org/wiki/Punctuated_equilibrium [accessed 2008-07-01] 

56 A very similar idea is utilized in the Extremal Optimization method discussed in Chapter 13. 

57 Or neutral networks, as discussed in Section 1.4.5. 
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could form a bridge to the global optimum. The optimizer now has a chance to escape the 
smaller peak if it is able to find and follow that bridge, i. e., the evolvability of the system 
has increased. If this bridge gets wider, as sketched in Fig. 1.21.C, the chance of finding the 
global optimum increases as well. Of course, if the bridge gets too wide, the optimization 
process may end up in a scenario like in Fig. 1.19.f on page 57 where it cannot find any 
direction. Furthermore, in this scenario we expect the neutral bridge to lead to somewhere 
useful, which is not necessarily the case in reality. 



global optimum 




Fig. 1.21. a: Premature Conver- Fig. 1.21.b: Small Neutral Fig. 1.21.c: Wide Neutral 
gence Bridge Bridge 



Figure 1.21: Possible positive influence of neutrality. 



Recently, the idea of utilizing the processes of molecular 58 and evolutionary 59 biology as 
complement to Darwinian evolution for optimization gains interest [144]. Scientists like Hu 
and Banzhaf [967, 968] have begun to study the application of metrics such as the evolution 
rate of gene sequences [2281, 2257] to evolutionary algorithms. Here, the degree of neutrality 
(synonymous vs. non-synonymous changes) seems to play an important role. 

Examples for neutrality in fitness landscapes are the ND family (see Section 21.2.3), the 
NKp and NKq models (discussed in Section 21.2.1), and the Royal Road (see Section 21.2.4). 
Another common instance of neutrality is bloat in Genetic Programming, which is outlined 
in Section 4.10.3 on page 224. 

Neutral Networks 

From the idea of neutral bridges between different parts of the search space as sketched by 
Smith et al. [1913], we can derive the concept of neutral networks. 

Definition 1.51 (Neutral Network). Neutral networks are equivalence classes K of el- 
ements of the search space G which map to elements of the problem space X with the same 
objective values and are connected by chains of applications of the search operators Op [149]. 

Vgi,g 2 e G : 9l e K(g 2 ) CG«leN„: P(g 2 = Op k ( 9l )) > A 

F(gpm( 5 i)) = F(gpm( ff2 )) (1.44) 

Barnett [149] states that a neutral network has the constant innovation property if 

58 http://en.wikipedia.org/wiki/Molecular_biology [accessed 2008-07-20] 

59 http://en.wikipedia.org/wiki/Evolutionary_biology [accessed 2008-07-20] 
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1. the rate of discovery of innovations keeps constant for a reasonably large amount of 
applications of the search operations [981], and 

2. if this rate is comparable with that of an unconstrained random walk. 

Networks with this property may prove very helpful if they connect the optima in the fitness 
landscape. Stewart [1962] utilizes neutral networks and the idea of punctuated equilibria 
in his extrema selection, a genetic algorithm variant that focuses on exploring individuals 
which are far away from the centroid of the set of currently investigated solution candidates 
(but have still good objective values). Then again, Barnett [148] showed that populations 
in genetic algorithm tend to dwell in neutral networks of high dimensions of neutrality 
regardless of their objective values, which (obviously) cannot be considered advantageous. 

The convergence on neutral networks has furthermore been studied by Bornbcrg-Baucr 
and Chan [251], van Nimwegen et al. [2097, 2096], and Wilke [2225]. Their results show that 
the topology of neutral networks strongly determines the distribution of genotypes on them. 
Generally, the genotypes are "drawn" to the solutions with the highest degree of neutrality 
v on the neutral network Beaudoin et al. [161]. 

Redundancy: Problematic and Beneficial 

Definition 1.52 (Redundancy). Redundancy in the context of global optimization is a 
feature of the genotype-phenotype mapping and means that multiple genotypes map to the 
same phenotype, i. e., the genotype-phenotype mapping is not injective. 

92 ■ 9i ^ 92 A gpm(si) = gpm(.g 2 ) (1.45) 

The role of redundancy in the genome is as controversial as that of neutrality [2168]. 
There exist many accounts of its positive influence on the optimization process. Shipman 
et al. [1871, 1856], for instance, tried to mimic desirable evolutionary properties of RNA 
folding [980] . They developed redundant genotype-phenotype mappings using voting (both, 
via uniform redundancy and via a non-trivial approach), Turing machine-like binary instruc- 
tions, Cellular automata, and random Boolean networks [1099]. Except for the trivial voting 
mechanism based on uniform redundancy, the mappings induced neutral networks which 
proved beneficial for exploring the problem space. Especially the last approach provided par- 
ticularly good results [1871, 1856]. Possibly converse effects like epistasis (see Section 1.4.6) 
arising from the new genotype-phenotype mappings have not been considered in this study. 

Redundancy can have a strong impact on the explorability of the problem space. When 
utilizing a one-to-one mapping, the translation of a slightly modified genotype will always 
result in a different phenotype. If there exists a many-to-one mapping between genotypes 
and phenotypes, the search operations can create offspring genotypes different from the 
parent which still translate to the same phenotype. The optimizer may now walk along a 
path through this neutral network. If many genotypes along this path can be modified to 
different offspring, many new solution candidates can be reached [1871]. One example for 
beneficial redundancy is the extradimensional bypass idea discussed in Section 1.5.2. 

The experiments of Shipman et al. [1872, 1870] additionally indicate that neutrality 
in the genotype-phenotype mapping can have positive effects. In the Cartesian Genetic 
Programming method, neutrality is explicitly introduced in order to increase the evolvability 
(see Section 4.7.4 on page 201) [2110, 2297]. 

Yet, Rothlauf [1765] and Shacklcton et al. [1856] show that simple uniform redundancy 
is not necessarily beneficial for the optimization process and may even slow it down. There 
is no use in introducing encodings which, for instance, represent each phenotypic bit with 
two bits in the genotype where 00 and 01 map to and 10 and 11 map to 1. Another example 
for this issue is given in Fig. 1.31.b on page 86. 
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Summary 

Different from ruggedness which is always bad for optimization algorithms, neutrality has 
aspects that may further as well as hinder the process of finding good solutions. Generally 
we can state that degrees of neutrality v very close to 1 degenerate optimization processes 
to random walks. Some forms of neutral networks accompanied by low (nonzero) values of 
v can improve the evolvability and hence, increase the chance of finding good solutions. 

Adverse forms of neutrality are often caused by bad design of the search space or 
genotype-phenotype mapping. Uniform redundancy in the genome should be avoided where 
possible and the amount of neutrality in the search space should generally be limited. 

Needle-In- A-Haystack 

One of the worst cases of fitness landscapes is the needle- in- a-hay stack (NIAH) problem 
sketched in Fig. 1.19.g on page 57, where the optimum occurs as isolated spike in a plane. In 
other words, small instances of extreme ruggedness combine with a general lack of informa- 
tion in the fitness landscape. Such problems are extremely hard to solve and the optimization 
processes often will converge prematurely or take very long to find the global optimum. An 
example for such fitness landscapes is the all-or-nothing property often inherent to Genetic 
Programming of algorithms [2058], as discussed in Section 4.10.2 on page 223. 

1.4.6 Epistasis 
Introduction 

In biology, epistasis (M is defined as a form of interaction between different genes [1640]. 
The term was coined by Bateson [157] and originally meant that one gene suppresses the 
phenotypical expression of another gene. In the context of statistical genetics, epistasis was 
initially called "epistacy" by Fisher [677]. According to Lush [1335], the interaction between 
genes is epistatic if the effect on the fitness of altering one gene depends on the allelic state of 
other genes. This understanding of epistasis comes very close to another biological expression: 
Pleiotropy bl , which means that a single gene influences multiple phenotypic traits [2227]. In 
the area of global optimization, such fine-grained distinctions are usually not made and the 
two terms are often used more or less synonymously. 

Definition 1.53 (Epistasis). In optimization, epistasis is the dependency of the contribu- 
tion of one gene to the value of the objective functions on the allelic state of other genes. 
[491, 44, 1503] 

We speak of minimal epistasis when every gene is independent of every other gene. Then, 
the optimization process equals finding the best value for each gene and can most efficiently 
be carried out by a simple greedy search (see Section 17.4.1) [491]. A problem is maximally 
epistatic when no proper subset of genes is independent of any other gene [1924, 1503]. 
Examples of problems with a high degree of epistasis are Kauffman's NK fitness landscape 
[1098, 1100] (Section 21.2.1), the p-Spin model [48] (Section 21.2.2), and the tunable model 
of Weise et al. [2185] (Section 21.2.7). 

The Problem 

As sketched in Figure 1.22, epistasis has a strong influence on many of the previously dis- 
cussed problematic features. If one gene can "turn off" or affect the expression of other 



http://en.wikipedia.org/wiki/Epistasis [accessed 2008-05-31] 
http://en.wikipedia.org/wiki/Pleiotropy [accessed 2008-03-02] 
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genes, a modification of this gene will lead to a large change in the features of the phcno- 
type. Hence, the causality will be weakened and ruggedness ensues in the fitness landscape. 
It also becomes harder to define search operations with exploitive character. Moreover, sub- 
sequent changes to the "deactivated" genes may have no influence on the phenotype at all, 
which would then increase the degree of neutrality in the search space. Epistasis is mainly an 
aspect of the way in which the genome G and the genotype-phenotype mapping are defined. 
It should be avoided where possible. 




Figure 1.22: The influence of epistasis on the fitness landscape. 



Generally, epistasis and conflicting objectives in multi-objective optimization should be 
distinguished from each other. Epistasis as well as pleiotropy is a property of the influence 
of the editable elements (the genes) of the genotypes on the phenotypes. Objective functions 
can conflict without the involvement of any of these phenomena. We can, for example, 
define two objective functions fi(x) — x and fi{%) = —x which are clearly contradicting 
regardless of whether they both are subject to maximization or minimization. Nevertheless, 
if the solution candidates x and the genotypes are simple real numbers and the genotype- 
phenotype mapping is an identity mapping, neither epistatic nor pleiotropic effects can 
occur. 

Naudts and Verschoren [1504] have shown for the special case of length-two binary string 
genomes that deceptiveness does not occur in situations with low epistasis and also that 
objective functions with high epistasis are not necessarily deceptive. Another discussion 
about different shapes of fitness landscapes under the influence of epistasis is given by 
Beerenwinkel et al. [167]. 

Countermeasures 

General 

We have shown that epistasis is a root cause for multiple problematic features of optimiza- 
tion tasks. General countermeasures against epistasis can be divided into two groups. The 
symptoms of epistasis can be mitigated with the same methods which increase the chance of 
finding good solutions in the presence of ruggedness or neutrality - using larger populations 
and favoring explorative search operations. Epistasis itself is a feature which results from 
the choice of the search space structure, the search operations, and the genotype-phenotype 
mapping. Avoiding epistatic effects should be a major concern during their design. This can 
lead to a great improvement in the quality of the solutions produced by the optimization 
process [2181]. Some general rules for search space design are outlined in Section 1.5.2. 
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Linkage Learning 

According to Winter et al. [2242], linkage is "the tendency for alleles of different genes to 
be passed together from one generation to the next" in genetics. This usually indicates 
that these genes are closely located in the same chromosome. In the context of evolutionary 
algorithms, this notation is not useful since identifying spatially close elements inside the 
genotypes gGGis trivial. Instead, we are interested in alleles of different genes which have 
a joint effect on the fitness [1486, 1485]. 

Identifying these linked genes, i.e., learning their epistatic interaction, is very helpful for 
the optimization process. Such knowledge can be used to protect building blocks'' 2 from being 
destroyed by the search operations (such as crossover in genetic algorithms), for instance. 
Finding approaches for linkage learning has become an especially popular discipline in the 
area of evolutionary algorithms with binary [896, 1486, 1647] and real [546] genomes. Two 
important methods from this area are the messy GA (mGA, see Section 3.7) by Goldberg 
et al. [825] and the Bayesian Optimization Algorithm (BOA) [1633, 333]. Module acquisition 
[66] may be considered as such an effort. 

1.4.7 Noise and Robustness 
Introduction Noise 

In the context of optimization, three types of noise can be distinguished. The first form is 
noise in the training data used as basis for learning (i). In many applications of machine 
learning or optimization where a model for a given system is to be learned, data samples 
including the input of the system and its measured response are used for training. Some 
typical examples of situations where training data is the basis for the objective function 
evaluation are 

1. the usage of global optimization for building classifiers (for example for predicting buying 
behavior using data gathered in a customer survey for training), 

2. the usage of simulations for determining the objective values in Genetic Programming 
(here, the simulated scenarios correspond to training cases), and 

3. the fitting of mathematical functions to (x, y)-data samples (with artificial neural net- 
works or symbolic regression, for instance). 

Since no measurement device is 100% accurate and there are always random errors, noise is 
present in such optimization problems. 

Besides inexactnesses and fluctuations in the input data of the optimization process, 
perturbations are also likely to occur during the application of its results. This category 
subsumes the other two types of noise: perturbations that may arise from (ii) inaccuracies 
in the process of realizing the solutions and (Hi) environmentally induced perturbations 
during the applications of the products. 

This issue can be illustrated by using the process of developing the perfect tire for a car 
as an example. As input for the optimizer, all sorts of material coefficients and geometric 
constants measured from all known types of wheels and rubber could be available. Since 
these constants have been measured or calculated from measurements, they include a certain 
degree of noise and imprecision (i). 

The result of the optimization process will be the best tire construction plan discovered 
during its course and it will likely incorporate different materials and structures. We would 
hope that the tires created according to the plan will not fall apart if, accidently, an extra 
0.0001% of a specific rubber component is used (ii). During the optimization process, the 
behavior of many construction plans will be simulated in order to find out about their 
utility. When actually manufactured, the tires should not behave unexpectedly when used 



See Section 3.6.5 for information on the Building Block Hypothesis. 



1.4 Problems in Optimization 71 

in scenarios different from those simulated (in) and should instead be applicable in all driving 
situations likely to occur. 

The effects of noise in optimization have been studied by various researchers; Miller 
and Goldberg [1416, 1415], Lee and Wong [1268], and Gurin and Rastrigin [870] are some 
of them. Many global optimization algorithms and theoretical results have been proposed 
which can deal with noise. Some of them are, for instance, specialized 

1. genetic algorithms [685, 2062, 2060, 1799, 1800, 1146], 

2. Evolution Strategies [195, 100, 881], and 

3. Particle Swarm Optimization [1606, 884] approaches. 

The Problem: Need for Robustness 

The goal of global optimization is to find the global optima of the objective functions. While 
this is fully true from a theoretical point of view, it may not suffice in practice. Optimization 
problems are normally used to find good parameters or designs for components or plans to 
be put into action by human beings or machines. As we have already pointed out, there will 
always be noise and perturbations in practical realizations of the results of optimization. 
There is no process in the world that is 100% accurate and the optimized parameters, 
designs, and plans have to tolerate a certain degree of imprecision. 

Definition 1.54 (Robustness). A system in engineering or biology iarobust if it is able to 
function properly in the face of genetic or environmental perturbations [2132]. 

Therefore, a local optimum (or even a non-optimal element) for which slight disturbances 
only lead to gentle performance degenerations is usually favored over a global optimum lo- 
cated in a highly rugged area of the fitness landscape [276]. In other words, local optima in 
regions of the fitness landscape with strong causality are sometimes better than global op- 
tima with weak causality. Of course, the level of this acceptability is application-dependent. 
Figure 1.23 illustrates the issue of local optima which are robust vs. global optima which 
are not. More examples from the real world are: 

1. When optimizing the control parameters of an airplane or a nuclear power plant, the 
global optimum is certainly not used if a slight perturbation can have hazardous effects 
on the system [2062]. 

2. Wiesmann et al. [2218, 2217] bring up the topic of manufacturing tolerances in multilayer 
optical coatings. It is no use to find optimal configurations if they only perform optimal 
when manufactured to a precision which is either impossible or too hard to achieve on 
a constant basis. 

3. The optimization of the decision process on which roads should be precautionary salted 
for areas with marginal winter climate is an example of the need for dynamic robustness. 
The global optimum of this problem is likely to depend on the daily (or even current) 
weather forecast and may therefore be constantly changing. Handa et al. [886] point 
out that it is practically infeasible to let road workers follow a constantly changing plan 
and circumvent this problem by incorporating multiple road temperature settings in the 
objective function evaluation. 

4. Tsutsui et al. [2062, 2060] found a nice analogy in nature: The phenotypic characteristics 
of an individual are described by its genetic code. During the interpretation of this code, 
perturbations like abnormal temperature, nutritional imbalances, injuries, illnesses and 
so on may occur. If the phenotypic features emerging under these influences have low fit- 
ness, the organism cannot survive and procreate. Thus, even a species with good genetic 
material will die out if its phenotypic features become too sensitive to perturbations. 
Species robust against them, on the other hand, will survive and evolve. 




Countermeasures 

For the special case where the phenome is a real vector space (X C R"), several approaches 
for dealing with the need for robustness have been developed. Inspired by Taguchi meth- 
ods 63 [1995], possible disturbances are represented by a vector S = (Si, S2, ■-, 5 n ) T , Si £ R 
in the method suggested by Greiner [859, 860]. If the distributions and influences of 
the Si are known, the objective function /(x) : x e X can be rewritten as /(x, 8) 
[2218]. In the special case where 6 is normally distributed, this can be simplified to 

/ ((xi + Si , X2 + S2, ■-, x n + S n ) T ^j . It would then make sense to sample the probability distri- 
bution of S a number of t times and to use the mean values of /(x, 5) for each objective func- 
tion evaluation during the optimization process. In cases where the optimal value y* of the 
objective function / is known, Equation 1.46 can be minimized. This approach is also used 
in the work of Wiesmann et al. [2217, 2218] and basically turns the optimization algorithm 
into something like a maximum likelihood estimator (see Section 28.7.2 and Equation 28.252 
on page 502). 

/'(x) = i^(y*-/(x,5 i )) 2 (1.46) 

i=i 

This method corresponds to using multiple, different training scenarios during the objec- 
tive function evaluation in situations where X % R™. By adding random noise and artificial 
perturbations to the training cases, the chance of obtaining robust solutions which are stable 
when applied or realized under noisy conditions can be increased. 

1.4.8 Overfitting and Oversimplification 

In all scenarios where optimizers evaluate some of the objective values of the solution can- 
didates by using training data, two additional phenomena with negative influence can be 
observed: overfitting and oversimplification. 

Overfitting 

The Problem 

Definition 1.55 (Overfitting). Overfitting'' 4 is the emergence of an overly complicated 
model (solution candidate) in an optimization process resulting from the effort to provide 
the best results for as much of the available training data as possible [1805, 1905, 785, 564]. 

63 http://en.wikipedia.org/wiki/Taguchi_methods [accessed 2008-07-19] 

64 http://en.wikipedia.org/wiki/Overfitting [accessed 2007-07-03] 
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A model (solution candidate) m € X optimized based on a finite set of training data 
is considered to be overfitted if a less complicated, alternative model ml € X exists which 
has a smaller error for the set of all possible (maybe even infinitely many), available, or 
(theoretically) producible data samples. This model ml may, however, have a larger error in 
the training data. 

The phenomenon of overfitting is best known and can often be encountered in the field 
of artificial neural networks or in curve fitting 65 [2019, 1291, 1265, 1806, 1761]. The latter 
means that we have a set A of n training data samples (Xi,yi) and want to find a function 
/ that represents these samples as well as possible, i. e., f(xi) = V (x^i/i) e A. 

There exists exactly one polynomial 61 ' of the degree n — 1 that fits to each such training 
data and goes through all its points. 6 ' Hence, when only polynomial regression is performed, 
there is exactly one perfectly fitting function of minimal degree. Nevertheless, there will also 
be an infinite number of polynomials with a higher degree than n — 1 that also match the 
sample data perfectly. Such results would be considered as overfitted. 

In Figure 1.24, we have sketched this problem. The function fi(x) = x shown in 
Fig. 1.24.b has been sampled three times, as sketched in Fig. 1.24. a. There exists no other 
polynomial of a degree of two or less that fits to these samples than f±. Optimizers, however, 
could also find overfitted polynomials of a higher degree such as ji which also match the 
data, as shown in Fig. 1.24.c. Here, fa plays the role of the overly complicated model m 
which will perform as good as the simpler model ml when tested with the training sets only, 
but will fail to deliver good results for all other input data. 




Fig. 1.24.a: Three sample Fig. 1.24.b: ml = fi(x) = x. Fig. 1.24.c: m = f 2 (x). 

points of f 1 . 

Figure 1.24: Overfitting due to complexity. 



A very common cause for overfitting is noise in the sample data. As we have already 
pointed out, there exists no measurement device for physical processes which delivers per- 
fect results without error. Surveys that represent the opinions of people on a certain topic 
or randomized simulations will exhibit variations from the true interdependencies of the ob- 
served entities, too. Hence, data samples based on measurements will always contain some 
noise. 

In Figure 1.25 we have sketched how such noise may lead to overfitted results. Fig. 1.25. a 
illustrates a simple physical process obeying some quadratic equation. This process has been 
measured using some technical equipment and the 100 noisy samples depicted in Fig. 1.25.b 
has been obtained. Fig. 1.25.C shows a function resulting from an optimization that fits 
the data perfectly. It could, for instance, be a polynomial of degree 99 that goes right 
through all the points and thus, has an error of zero. Although being a perfect match to the 

65 We will discuss overfitting in conjunction with Genetic Programming-based symbolic regression 
in Section 23.1 on page 397. 

66 http://en.wikipedia.org/wiki/Polynomial [accessed 2007-07-03] 

67 http://en.wikipedia.org/wiki/Polynomial_interpolation [accessed 2008-03-01] 
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measurements, this complicated model does not accurately represent the physical law that 
produced the sample data and will not deliver precise results for new, different inputs. 




Fig. 1.25. a: The original phys- Fig. 1.25.b: The measuremen- Fig. 1.25.C: The overfitted re- 
ical process. t/training data. suit. 



Figure 1.25: Fitting noise. 



From the examples we can see that the major problem that results from overfitted solu- 
tions is the loss of generality. 

Definition 1.56 (Generality). A solution of an optimization process is general if it is 
not only valid for the sample inputs a\ , , ■ ■ ■ , a n which were used for training during the 
optimization process, but also for different inputs a ^ dj \fi : < i < n if such inputs a 
exist. 

Countermeasures 

There exist multiple techniques that can be utilized in order to prevent overfitting to a 
certain degree. It is most efficient to apply multiple such techniques together in order to 
achieve best results. 

A very simple approach is to restrict the problem space X in a way that only solutions up 
to a given maximum complexity can be found. In terms of function fitting, this could mean 
limiting the maximum degree of the polynomials to be tested. Furthermore, the functional 
objective functions which solely concentrate on the error of the solution candidates should 
be augmented by penalty terms and non-functional objective functions putting pressure in 
the direction of small and simple models [564, 1108]. 

Large sets of sample data, although slowing down the optimization process, may improve 
the generalization capabilities of the derived solutions. If arbitrarily many training datasets 
or training scenarios can be generated, there are two approaches which work against over- 
fitting: 

1. The first method is to use a new set of (randomized) scenarios for each evaluation of 
each solution candidate. The resulting objective values then may differ largely even if 
the same individual is evaluated twice in a row, introducing incoherence and ruggedness 
into the fitness landscape. 

2. At the beginning of each iteration of the optimizer, a new set of (randomized) scenarios 
is generated which is used for all individual evaluations during that iteration. This 
method leads to objective values which can be compared without bias. They can be 
made even more comparable if the objective functions are always normalized into some 
fixed interval, say [0, 1]. 

In both cases it is helpful to use more than one training sample or scenario per evaluation 
and to set the resulting objective value to the average (or better median) of the outcomes. 
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Otherwise, the fluctuations of the objective values between the iterations will be very large, 
making it hard for the optimizers to follow a stable gradient for multiple steps. 

Another simple method to prevent overfitting is to limit the runtime of the optimizers 
[1805]. It is commonly assumed that learning processes normally first find relatively general 
solutions which subsequently begin to overfit because the noise "is learned" , too. 

For the same reason, some algorithms allow to decrease the rate at which the solution 
candidates are modified by time. Such a decay of the learning rate makes overfitting less 
likely. 

Dividing Data into Training and Test Sets If only one finite set of data samples is available 
for training/optimization, it is common practice to separate it into a set of training data 
A t and a set of test cases A c . During the optimization process, only the training data is 
used. The resulting solutions are tested with the test cases afterwards. If their behavior is 
significantly worse when applied to A c than when applied to A t , they are probably overfitted. 

The same approach can be used to detect when the optimization process should be 
stopped. The best known solution candidates can be checked with the test cases in each 
iteration without influencing their objective values which solely depend on the training data. 
If their performance on the test cases begins to decrease, there are no benefits in letting the 
optimization process continue any further. 

Oversimplification 

The Problem 

Oversimplification (also called overgeneralization) is the opposite of overfitting. Whereas 
overfitting denotes the emergence of overly complicated solution candidates, oversimplified 
solutions are not complicated enough. Although they represent the training samples used 
during the optimization process seemingly well, they are rough overgeneralizations which 
fail to provide good results for cases not part of the training. 

A common cause for oversimplification is sketched in Figure 1.26: The training sets 
only represent a fraction of the set of possible inputs. As this is normally the case, one 
should always be aware that such an incomplete coverage may fail to represent some of the 
dependencies and characteristics of the data, which then may lead to oversimplified solutions. 
Another possible reason for oversimplification is that ruggedness, deceptiveness, too much 
neutrality, or high epistasis in the fitness landscape may lead to premature convergence and 
prevent the optimizer from surpassing a certain quality of the solution candidates. It then 
cannot adapt them completely even if the training data perfectly represents the sampled 
process. A third possible cause is that a problem space could have been chosen which does 
not include the correct solution. 

Fig. 1.26. a shows a cubic function. Since it is a polynomial of degree three, four sample 
points are needed for its unique identification. Maybe not knowing this, only three samples 
have been provided in Fig. 1.26.b. By doing so, some vital characteristics of the function 
are lost. Fig. 1.26.C depicts a square function - the polynomial of the lowest degree that fits 
exactly to these samples. Although it is a perfect match, this function does not touch any 
other point on the original cubic curve and behaves totally differently at the lower parameter 
area. 

However, even if we had included point P in our training data, it would still be possible 
that the optimization process would yield Fig. 1.26.C as a result. Having training data that 
correctly represents the sampled system does not mean that the optimizer is able to find a 
correct solution with perfect fitness - the other, previously discussed problematic phenomena 
can prevent it from doing so. Furthermore, if it was not known that the system which was 
to be modeled by the optimization process can best be represented by a polynomial of the 
third degree, one could have limited the problem space X to polynomials of degree two and 
less. Then, the result would likely again be something like Fig. 1.26.C, regardless of how 
many training samples are used. 
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Fig. 1.26. a: The "real system" Fig. 1. 26. b: The sampled train- Fig. 1.26.C: The oversimplified 
and the points describing it. ing data. result. 



Figure 1.26: Oversimplification. 



Countermeasures 

In order to counter oversimplification, its causes have to be mitigated. Generally, it is not 
possible to have training scenarios which cover the complete input space of the evolved 
programs. By using multiple scenarios for each individual evaluation, the chance of missing 
important aspects is decreased. These scenarios can be replaced with new, randomly created 
ones in each generation, in order to decrease this chance even more. The problem space, i.e., 
the representation of the solution candidates, should further be chosen in a way which 
allows constructing a correct solution to the problem defined. Then again, releasing too 
many constraints on the solution structure increases the risk of overfitting and thus, careful 
proceeding is recommended. 

1.4.9 Dynamically Changing Fitness Landscape 

It should also be mentioned that there exist problems with dynamically changing fitness 
landscapes [282, 1465, 1729, 277, 278]. The task of an optimization algorithm is then to 
provide solution candidates with momentarily optimal objective values for each point in 
time. Here we have the problem that an optimum in iteration t will possibly not be an 
optimum in iteration t + 1 anymore. 

Problems with dynamic characteristics can, for example, be tackled with special forms 
[2280] of 

1. evolutionary algorithms [2053, 2224, 279, 280, 1463, 1464, 82], 

2. genetic algorithms [817, 1457, 1458, 1459, 1146], 

3. Particle Swarm Optimization [343, 344, 1280, 1605, 211], 

4. Differential Evolution [1391, 2266], and 

5. Ant Colony Optimization [868, 869] 

The moving peaks benchmarks by Branke [277, 278] and Morrison and De Jong [1465] 
are good examples for dynamically changing fitness landscapes. You can find them discussed 
in Section 21.1.3 on page 328. 

1.4.10 The No Free Lunch Theorem 



By now, we know the most important problems that can be encountered when applying 
an optimization algorithm to a given problem. Furthermore, we have seen that it is arguable 
what actually an optimum is if multiple criteria are optimized at once. The fact that there 
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is most likely no optimization method that can outperform all others on all problems can, 
thus, easily be accepted. Instead, there exist a variety of optimization methods specialized 
in solving different types of problems. There are also algorithms which deliver good results 
for many different problem classes, but may be outperformed by highly specialized methods 
in each of them. These facts have been formalized by Wolpert and Macready [2244, 2245] 
in their No Free Lunch Theorems 68 (NFL) for search and optimization algorithms. 

Initial Definitions 

Wolpert and Macready [2245] consider single-objective optimization and define an optimiza- 
tion problem <f>(g) = /(gpm(g)) as a mapping of a search space G to the objective space Y. (>!) 
Since this definition subsumes the problem space and the genotype-phenotype mapping, only 
skipping the possible search operations, it is very similar to our Definition 1.34 on page 46. 
They further call a time-ordered set d m of m distinct visited points in G x Y a "sample" of 
size m and write d m = {(d£,(l), <P m (l)) , «(2), <&(2)) , • ■ ■ . <&(m))}. <C« is the 

genotype and rf^(i) the corresponding objective value visited at time step i. Then, the set 
D m = (G x Y) m is the space of all possible samples of length m and D = U ro >oD ra is the 
set of all samples of arbitrary size. 

An optimization algorithm a can now be considered to be a mapping of the previously 
visited points in the search space (i. e., a sample) to the next point to be visited. Formally, 
this means a : D G. Without loss of generality, Wolpert and Macready [2245] only regard 
unique visits and thus define a : d € D g : g £ d. 

Performance measures \P can be defined independently from the optimization algorithms 
only based on the values of the objective function visited in the samples d m . If the objective 
function is subject to minimization, ^(d^J = min{d^ : i = l..m} would be the appropriate 
measure. 

Often, only parts of the optimization problem <f> are known. If the minima of the objective 
function / were already identified beforehand, for instance, its optimization would be useless. 
Since the behavior in wide areas of <f) is not obvious, it makes sense to define a probability 
P{4>) that we are actually dealing with <j> and no other problem. Wolpert and Macready 
[2245] use the handy example of the travelling salesman problem in order to illustrate this 
issue. Each distinct TSP produces a different structure of </>. Yet, we would use the same 
optimization algorithm a for all problems of this class without knowing the exact shape of <j). 
This corresponds to the assumption that there is a set of very similar optimization problems 
which we may encounter here although their exact structure is not known. We act as if there 
was a probability distribution over all possible problems which is non-zero for the TSP-alikc 
ones and zero for all others. 

The Theorem 

The performance of an algorithm a iterated m times on an optimization problem (f> can 
then be defined as \<fi, m, a), i. e., the conditional probability of finding a particular 

sample d v m . Notice that this measure is very similar to the value of the problem landscape 
<1>(x,t) introduced in Definition 1.38 on page 48 which is the cumulative probability that 
the optimizer has visited the element x e X until (inclusively) the r th evaluation of the 
objective function(s). 

Wolpert and Macready [2245] prove that the sum of such probabilities over all possi- 
ble optimization problems <f> is always identical for all optimization algorithms. For two 
optimizers a\ and a 2 , this means that 

68 http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization [accessed 2008-03- 

28] 

69 Notice that we have partly utilized our own notations here in order to be consistent throughout 
the book. 
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]T P« |& m, ai )=J2 P(d y m \<f>, m, a 2 ) (1.47) 
Hence, the average over all <j> of -P(d„ |0, m, a) is independent of a. 
Implications 

From this theorem, we can immediately follow that, in order to outperform a\ in one opti- 
mization problem, a 2 will necessarily perform worse in another. Figure 1.27 visualizes this 
issue. It shows that general optimization approaches like evolutionary algorithms can solve 
a variety of problem classes with reasonable performance. In this figure, we have chosen 
a performance measure <S> subject to maximization, i. e., the higher its values, the faster 
will the problem be solved. Hill climbing approaches, for instance, will be much faster than 
evolutionary algorithms if the objective functions are steady and monotonous, that is, in a 
smaller set of optimization tasks. Greedy search methods will perform fast on all problems 
with matroid 70 structure. Evolutionary algorithms will most often still be able to solve these 
problems, it just takes them longer to do so. The performance of hill climbing and greedy 
approaches degenerates in other classes of optimization tasks as a trade-off for their high 
utility in their "area of expertise" . 




random walk or exhaustive enumeration or ... 
— — general optimization algorithm - an EA, for instance 
------ specialized optimization algorithm 1; a hill climber, for instance 

^— specialized optimization algorithm 2; a depth- first search, for instance 

Figure 1.27: A visualization of the No Free Lunch Theorem. 

One interpretation of the No Free Lunch Theorem is that it is impossible for any opti- 
mization algorithm to outperform random walks or exhaustive enumerations on all possible 
problems. For every problem where a given method leads to good results, we can construct 
a problem where the same method has exactly the opposite effect (see Section 1.4.4). As 
a matter of fact, doing so is even a common practice to find weaknesses of optimization 
algorithms and to compare them with each other, see Section 21.2.6, for example. 



http : //en . wikipedia . org/wiki/Matroid [accessed 2008-03-28] 
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Another interpretation is that every useful optimization algorithm utilizes some form 
of problem-specific knowledge. Radcliffe [1696] states that without such knowledge, search 
algorithms cannot exceed the performance of simple enumerations. Incorporating knowledge 
starts with relying on simple assumptions like "if x is a good solution candidate, than we 
can expect other good solution candidates in its vicinity", i. e., strong causality. The more 
(correct) problem specific knowledge is integrated (correctly) into the algorithm structure, 
the better will the algorithm perform. On the other hand, knowledge correct for one class 
of problems is, quite possibly, misleading for another class. In reality, we use optimizers to 
solve a given set of problems and are not interested in their performance when (wrongly) 
applied to other classes. 

The rough meaning of the NLF is that all black-box optimization methods perform 
equally well over the complete set of all optimization problems [1563]. In practice, we do not 
want to apply an optimizer to all possible problems but to only some, restricted classes. In 
terms of these classes, we can make statements about which optimizer performs better. 

Today, there exists a wide range of work on No Free Lunch Theorems for many different 
aspects of machine learning. The website http://www.rio-free-lunch.org/ 71 gives a good 
overview about them. Further summaries, extensions, and criticisms have been provided by 
Koppcn et al. [1173], Droste et al. [602, 601, 599, 600], Oltean [1563], and Igel and Toussaint 
[1008, 1009]. Radcliffe and Surry [1694] discuss the NFL in the context of evolutionary 
algorithms and the representations used as search spaces. The No Free Lunch Theorem is 
furthermore closely related to the Ugly Duckling Theorem 72 proposed by Watanabe [2159] 
for classification and pattern recognition. 

1.4.11 Conclusions 

The subject of this introductory chapter was the question about what makes optimization 
problems hard, especially for metahcuristic approaches. We have discussed numerous differ- 
ent phenomena which can affect the optimization process and lead to disappointing results. 

If an optimization process has converged prematurely, it has been trapped in a non- 
optimal region of the search space from which it cannot "escape" anymore (Section 1.4.2). 
Ruggedness (Section 1.4.3) and deceptiveness (Section 1.4.4) in the fitness landscape, of- 
ten caused by cpistatic effects (Section 1.4.6), can misguide the search into such a region. 
Neutrality and redundancy (Section 1.4.5) can either slow down optimization because the 
application of the search operations does not lead to a gain in information or may also con- 
tribute positively by creating neutral networks from which the search space can be explored 
and local optima can be escaped from. Noise is present in virtually all practical optimization 
problems. The solutions that are derived for them should be robust (Section 1.4.7). Also, 
they should neither be too general (oversimplification, Section 1.4.8) nor too specifically 
aligned only to the training data (overfitting, Section 1.4.8). Furthermore, many practical 
problems are multi-objective, i. e., involve the optimization of more than one criterion at 
once (partially discussed in Section 1.2.2), or concern objectives which may change over time 
(Section 1.4.9). 

In the previous section, we discussed the No Free Lunch Theorem and argued that it is 
not possible to develop the one optimization algorithm, the problem-solving machine which 
can provide us with near-optimal solutions in short time for every possible optimization 
task. This must sound very depressing for everybody new to this subject. 

Actually, quite the opposite is the case, at least from the point of view of a researcher. 
The No Free Lunch Theorem means that there will always be new ideas, new approaches 
which will lead to better optimization algorithms to solve a given problem. Instead of being 
doomed to obsolescence, it is far more likely that most of the currently known optimization 
methods have at least one niche, one area where they are excellent. It also means that it 
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is very likely that the "puzzle of optimization alorithms" will never be completed. There 
will always be a chance that an inspiring moment, an observation in nature, for instance, 
may lead to the invention of a new optimization algorithm which performs better in some 
problem areas than all currently known ones. 



1.5 Formae and Search Space/Operator Design 

Most global optimization algorithms share the premise that solutions to problems are either 
elements of a somewhat continuous space that can be approximated stepwise or that they can 
be composed of smaller modules which have good attributes even when occurring separately. 

The design of the search space (or genome) G and the genotype-phenotype mapping 
gpm is vital for the success of the optimization process. It determines to what degree these 
expected features can be exploited by defining how the properties and the behavior of 
the solution candidates are encoded and how the search operations influence them. In this 
chapter, we will first discuss a general theory about how properties of individuals can be 
defined, classified, and how they are related. We will then outline some general rules for 
the design of the genome which are inspired by our previous discussion of the possible 
problematic aspects of fitness landscapes. 

1.5.1 Forma Analysis 

The Schema Theorem has been stated for genetic algorithms by Holland [940] in its seminal 
work [940, 512, 945]. In this section, we are going to discuss it in the more general version 
from Weicker [2167] as introduced by Radcliffe and Surry [1695] and Surry [1983] in [1692, 
1696, 1691, 1691, 1695]. 

The different individuals p in the population Pop of the search and optimization algo- 
rithms are characterized by their properties 4>. Whereas the optimizers themselves focus 
mainly on the phenotypical properties since these are evaluated by the objective functions, 
the properties of the genotypes may be of interest in an analysis of the optimization perfor- 
mance. 

A rather structural property 4>\ of formulas / : M i— » R in symbolic regression' 3 would be 
whether it contains the mathematical expression x+1 or not. We can also declare a behavioral 
property </>2 which is true if |/(0) — 1| < 0.1 holds, i. e., if the result of / is close to a value 



More information on symbolic regression can be found in Section 23.1 on page 397. 
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1 for the input 0, and false otherwise. Assume that the formulas were decoded from a 
binary search space G = M n to the space of trees that represent mathematical expression by 
a genotype-phenotype mapping. A genotypical property then would be if a certain sequence 
of bits occurs in the genotype p.g and a phenotypical property is the number of nodes in 
the phenotype p.x, for instance. If we try to solve a graph-coloring problem, for example, a 
property <f>^ £ { black, white, gray] could denote the color of a specific vertex q as illustrated 
in Figure 1.29. 




Aijl,=gray 



Figure 1.29: An graph coloring-based example for properties and formae. 



In general, we can imagine the properties </>j to be some sort of functions that map the 
individuals to property values. <j)\ and <p2 would then both map the space of mathematical 
functions to the set B = {true, false} whereas 4>3 maps the space of all possible colorings 
for the given graph to the set {white, gray, black}. On the basis of the properties 4>i we can 
define equivalence relations' 4 ~<^: 

Pi ~& P2 => 4>i{pi) = Mp2) Vpi,P2 eGxX (1.48) 

Obviously, for each two solution candidates and X\ and x^, either x\ ~<p i 22 or x\ 7^. X2 
holds. These relations divide the search space into equivalence classes A^ i=v . 

Definition 1.57 (Forma). An equivalence class A^ i=v that contains all the individuals 
sharing the same characteristic v in terms of the property (f>i is called a forma [1691] or 
predicate [2122]. 

A$ i=v = {Vp e G x X : 4n(p) = v} (1.49) 

Vpi,p 2 S P! -0, p 2 (1.50) 

The number of formae induced by a property, i. e., the number of its different character- 
istics, is called its precision [1691]. The precision of <pi an d 4>2 is 2, for ^3 it is 3. We can 
define another property ^4 = /(0) denoting the value a mathematical function has for the 
input 0. This property would have an uncountable infinite large precision. 

Two formae A^ i=v and A^ j=w are said to be compatible, written as A,p i=v txi A^ j=w , if 
there can exist at least one individual which is an instance of both. 



See the definition of equivalence classes in Section 27.7.3 on page 464. 
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Figure 1.30: Example for formae in symbolic regression. 



n A^, j=w ^ (1.51) 
A^v X -4^=tu «3peGxX:pe Ap€ ,4^=™ (1.52) 

X A^ i=w w = v (1.53) 

Of course, two different formae of the same property 4>i, i. e., two different charac- 
teristics of (f>i, are always incompatible. In our initial symbolic regression example hence 
j40i=true ^ ^4<?ii=faise since it is not possible that a function / contains a term x + 1 and at 
the same time does not contain it. All formae of the properties </>l and 4>2 on the other hand 
are compatible: ^0 1=fa ise X ^ 2 =faise, A^, 1=ialse txi A^e, ^^true X ^0 2=fa ise, and 
Aff, 1= trxie x ^0 2= true all hold. If we take 04 into consideration, we will find that there exist 
some formae compatible with some of 4>2 and some that are not, like ^40 2= true X A^ l= \ and 

^0 2 =false X A^ l=2 , but ^0 2=t rue ^ ^0 4 =O and ^0 2=fa lse ^ ^0 4 =O.95- 

The discussion of forma and their dependencies stems from the evolutionary algorithm 
community and there especially from the supporters of the Building Block Hypothesis. The 
idea is that the algorithm first discovers formae which have a good influence on the overall 
fitness of the solution candidates. The hope is that there are many compatible ones under 
these formae that are then gradually combined in the search process. 

In this text we have defined formae and the corresponding terms on the basis of individ- 
uals p which are records that assign an element of the problem spaces p.x € X to an element 
of the search space p.g S G. Generally, we will relax this notation and also discuss forma 
directly in the context of the search space G or problem space X, when appropriate. 

1.5.2 Genome Design 

In software engineering, there are some design patterns 75 that describe good practice and 
experience values. Utilizing these patterns will help the software engineer to create well- 
organized, extensible, and maintainable applications. 

Whenever we want to solve a problem with global optimization algorithms, we need to 
define the structure of a genome. The individual representation along with the genotype- 

75 http : //en. wikipedia. org/wiki/Design_pattern_y,28computer_science°/o29 [accessed 2007-08-12] 
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phenotype mapping is a vital part of genetic algorithms and has major impact on the chance 
of finding good solutions. 

We have already discussed the basic problems that we may encounter during optimiza- 
tion. The choice of the search space, the search operations, and the genotype-phenotype 
mapping have major impact on the chance of finding good solutions. After formalizing the 
ideas of properties and formae, we will now outline some general best practices for the genome 
design from different perspectives. These principles can lead to finding better solutions or 
higher optimization speed if considered in the design phase [1765, 1525]. 

In Goldberg [821] defines two general design patterns for genotypes in genetic algorithm 
which we will state here in the context of the forma analysis [1525]: 

1. The representations of the formae in the search space should be as short as possible and 
the representations of different, compatible phenotypic formae should not influence each 
other. 

2. The alphabet of the encoding and the lengths of the different genes should be as small 
as possible. 

Both rules target for minimal redundancy in the genomes. We have already mentioned 
in Section 1.4.5 on page 67 that uniform redundancy slows down the optimization process. 
Especially the second rule focuses on this cause of neutrality by discouraging the use of 
unnecessary large alphabets for encoding in a genetic algorithm. Palmer and Kershcnbaum 
[1602, 1603] define additional rules for tree-representations in [1602, 1601], which have been 
generalized by Nguyen [1525]: 

3. A good search space and genotype-phenotype mapping should be able to represent all 
phenotypes, i. e., be surjective (see Section 27.7 on page 461). 

Vi€X^ 3g e G : x = gpm(g) (1.54) 

4. The search space G should be unbiased in the sense that all phenotypes are represented 
by the same number of genotypes. This property allows to efficiently select an unbiased 
start population, giving the optimizer the chance of reaching all parts of the problem 
space. 

y Xl ,x 2 € X => \{g e G : xx = gpm(. 9 )}| w \{g e G : x 2 = gpm(s)}| (1.55) 

5. The genotype-phenotype mapping should always yield valid phenotypes. The meaning 
of valid in this context is that if the problem space X is the set of all possible trees, 
only trees should be encoded in the genome. If we use the R 3 as problem space, no 
vectors with fewer or more elements than three should be produced by the genotype- 
phenotype mapping. This form of validity does not imply that the individuals are also 
correct solutions in terms of the objective functions. 

6. The genotype-phenotype mapping should be simple and bijective. 

7. The representations in the search space should possess strong causality (locality), i. e., 
small changes in the genotype lead to small changes in the phenotype (see Section 1.4.3). 
Optimally, this would mean that: 

Vxi, x 2 eX,jeG:ii= gpm(g) A x 2 = gpm(searchOp(g)) 4i2~^i (1.56) 

Ronald [1752] summarizes some further rules [1752, 1525]: 

8. The genotypic representation should be aligned to a set of reproduction operators in a 
way that good configurations of formae are preserved by the search operations and do 
not easily get lost during the exploration of the search space. 

9. The representations should minimize cpistasis (see Section 1.4.6 on page 68 and the 1 st 
rule) . 

10. The problem should be represented at an appropriate level of abstraction. 
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11. If a direct mapping between genotypes and phenotypes is not possible, a suitable artificial 
embryogeny approach should be applied. 

Let us now summarize some more conclusions for search spaces based on forma analysis 
as stated by Radcliffe [1692] and Weickcr [2167]. 

12. Formae in Genotypic and Phenotypic Space 

The optimization algorithms find new elements in the search space G by applying the search 
operations searchOp € Op. These operations can only create, modify, or combine genotypical 
formae since they usually have no information about the problem space. Most mathematical 
models dealing with the propagation of formae like the Building Block Hypothesis and the 
Schema Theorem 76 thus focus on the search space and show that highly fit genotypical for- 
mae will more probably be investigated further than those of low utility. Our goal, however, 
is to find highly fit formae in the problem space X. Such properties can only be created, 
modified, and combined by the search operations if they correspond to genotypical formae. 
A good genotype-phenotype mapping should provide this feature. 

It furthermore becomes clear that useful separate properties in phenotypic space can only 
be combined by the search operations properly if they are represented by separate formae 
in genotypic space too. 

13. Compatibility of Formae 

Formae of different properties should be compatible. Compatible Formae in phenotypic space 
should also be compatible in genotypic space. This leads to a low level of epistasis and hence 
will increase the chance of success of the reproduction operations. 

14. Inheritance of Formae 

The 8*' 1 rule mentioned Formae should not get lost during the exploration of the search space. 
From a good binary search operation like recombination (crossover) in genetic algorithms, 
we can expect that if its two parameters g\ and g 2 are members of a forma A, the resulting 
element will also be an instance of A. 

Vgi,52 eiCG^ searchOp (gi, g 2 ) € A (1-57) 

If we furthermore can assume that all instances of all formae A with minimal precision 
(A e mini) of an individual are inherited by at least one parent, the binary reproduction 
operation is considered as pure. 

V33 = searchOp (31, g 2 ) eG, VA£ mini : g 3 e A g 1 e A V g 2 & A (1.58) 

If this is the case, all properties of a genotype 33 which is a combination of two others 
31, 32 can be traced back to at least one of its parents. Otherwise, searchOp also performs an 
implicit unary search step, a mutation in genetic algorithm, for instance. Such properties, 
although discussed here for binary search operations only, can be extended to arbitrary n-ary 
operators. 



See Section 3.6 for more information on the Schema Theorem. 
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15. Combinations of Formae 

If genotypes gi, g 2 , ■ ■ ■ which are instances of different but compatible formae A\ ixi A 2 1x1 . . . 
are combined by a binary (or n-ary) search operation, the resulting genotype g should be an 
instance of both properties, i. e., the combination of compatible formae should be a forma 
itself. 

V#i e A u g 2 e A 2 ,--- searchOp(3i,sr 2 ,...) e A x n A 2 n ... 0) (1.59) 

If this principle holds for many individuals and formae, useful properties can be com- 
bined by the optimization step by step, narrowing down the precision of the arising, most 
interesting formae more and more. This should lead the search to the most promising regions 
of the search space. 

16. Reachability of Formae 

The set of available search operations Op should include at least one unary search operation 
which is able to reach all possible formae. If the binary search operations in Op all are pure, 
this unary operator is the only one (apart from creation operations) able to introduce new 
formae which are not yet present in the population. Hence, it should be able to find any 
given forma. 

17. Influence of Formae 

One rule which, in my opinion, was missing in the lists given by Radcliffe [1692] and Weicker 
[2167] is that the absolute contributions of the single formae to the overall objective values of 
a solution candidate should to be too different. Let us divide the phenotypic formae into those 
with positive and those with negative or neutral contribution and let us, for simplification 
purposes, assume that those with positive contribution can be arbitrarily combined. If one 
of the positive formae has a contribution with an absolute value much lower than those of 
the other positive formae, we will trip into the problem of domino convergence discussed 
in Section 1.4.2 on page 58. 

Then, the search will first discover the building blocks of higher value. This, itself, is 
not a problem. However, as we have already pointed out in Section 1.4.2, if the search is 
stochastic and performs exploration steps, chances are that alleles of higher importance get 
destroyed during this process and have to be rediscovered. The values of the less salient 
formae would then play no role. Thus, the chance of finding them strongly depends on how 
frequent the destruction of important formae takes place. 

Ideally, we would therefore design the genome and phenome in a way that the different 
characteristics of the solution candidate all influence the objective values to a similar degree. 
Then, the chance of finding good formae increases. 

(18.) Extradimensional Bypass 

Minimal-sized genomes are not always the best approach. An interesting aspect of genome 
design supporting this claim is inspired by the works of the theoretical biologist Conrad 
[436, 438, 440, 437]. According to his extradimensional bypass principle, it is possible to 
transform a rugged fitness landscape with isolated peeks into one with connected saddle 
points by increasing the dimensionality of the search space [387, 342] . In [440] he states that 
the chance of isolated peeks in randomly created fitness landscapes decreases when their 
dimensionality grows. 

This partly contradicts rule 1 and 2 which state that genomes should be as compact as 
possible. Conrad [440] does not suggest that nature includes useless sequences in the genome 
but either genes which allow for 



86 1 Introduction 



1. new pheno typical characteristics or 

2. redundancy providing new degrees of freedom for the evolution of a species. 

In some cases, such an increase in freedom makes more than up for the additional "costs" 
arising from the enlargement of the search space. The extradimensional bypass can be con- 
sidered as an example of positive neutrality (see Section 1.4.5). 




In Fig. 1.31. a, an example for the extradimensional bypass (similar to Fig. 6 in [246]) 
is sketched. The original problem had a one-dimensional search space G corresponding to 
the horizontal axis up front. As can be seen in the plane in the foreground, the objective 
function had two peeks: a local optimum on the left and a global optimum on the right, 
separated by a larger valley. When the optimization process began climbing up the local 
optimum, it was very unlikely that it ever could escape this hill and reach the global one. 

Increasing the search space to two dimensions (G'), however, opened up a path way 
between them. The two isolated peeks became saddle points on a longer ridge. The global 
optimum is now reachable from all points on the local optimum. 

Generally, increasing the dimension of the search space makes only sense if the added 
dimension has a non-trivial influence on the objective functions. Simply adding a useless new 
dimension (as done in Fig. 1.31.b) would be an example for some sort of uniform redundancy 
from which we already know (see Section 1.4.5) that it is not beneficial. Then again, adding 
useful new dimensions may be hard or impossible to achieve in most practical applications. 

A good example for this issue is given by Bongard and Paul [246] who used an EA to 
evolve a neural network for the motion control of a bipedal robot. They performed runs 
where the evolution had control over some morphological aspects and runs where it had 
not. The ability to change the leg with of the robots, for instance, comes at the expense 
of an increase of the dimensions of the search spaced. Hence, one would expect that the 
optimization would perform worse. Instead, in one series of experiments, the results were 
much better with the extended search space. The runs did not converge to one particular 
leg shape but to a wide range of different structures. This led to the assumption that the 
morphology itself was not so much target of the optimization but the ability of changing it 
transformed the fitness landscape to a structure more navigable by the evolution. 

In some other experimental runs of Bongard and Paul [246] , this phenomenon could not 
be observed, most likely because 
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1. the robot configuration led to a problem of too high complexity, i. e., ruggedness in the 
fitness landscape and/or 

2. the increase in dimensionality this time was too large to be compensated by the gain of 
evolvability. 

Further examples for possible benefits of "gradually complexifying" the search space are 
given by Malkin in his doctoral thesis [1351]. 

1.6 General Information 

To all the optimization methods that are discussed in this book, you will find such a General 
Information section. Here we outline some of the applications of the respective approach, 
name the most important conferences, journals, and books as well as link to some online 
resources. 

1.6.1 Areas Of Application 

Some example areas of application of global optimization algorithms are: 



Application References 



Chemistry, Chemical Engineering [204, 1787, 691] 

Biochemistry [690] 

Constraint Satisfaction Problems (CSP) [1519] 

Multi-Criteria Decision Making (MCDM) [877, 375] 

Biology [691] 

Engineering, Structural Optimization, and Design ^ ™L, ^ ^ ^ ^ 

oyl, 379J 

Economics and Finance [613, 691, 1051] 

Parameter Estimation [690] 

Mathematical Problems [761] 

Optics [132, 2057] 

Operations Research [691, 878] 

Networking and Communication [450] 

Section 23.2 on page 401 



This is just a small sample of the possible applications of global optimization algorithms. It 
has neither some sort of order nor a focus on some specific areas. In the general information 
sections of the following chapters, you will find many application examples for the algorithm 
discussed. 

1.6.2 Conferences, Workshops, etc. 

Some conferences, workshops and such and such on global optimization algorithms are: 



AAAI: National Conference on Artificial Intelligence 
http : / /www. aaai . org/Conf erences/conf erences .php [accord 2007-09-06] 
History: 2008: Chicago, Illinois, see [738] 

2007: Vancouver, British Columbia, Canada, see [954] 
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2006: Boston, Massachusetts, USA, see [805] 

2005: Pittsburgh, Pennsylvania, USA, see [1359] 

2004: San Jose, California, USA, see [1381] 

2002: Edmonton, Alberta, Canada, see [547] 

2000: Austin, Texas, USA, see [1103] 

1999: Orlando, Florida, USA, see [917] 

1998: Madison, Wisconsin, USA, sec [1472] 

1997: Providence, Rhode Island, USA, see [1219, 3] 

1996: Portland, Oregon, USA, see [410, 2] 

1994: Seattle, WA, USA, see [906] 

1993: Washington, DC, USA, see [668] 

1992: San Jose, California, USA, see [1986] 

1991: Anaheim, California, USA, see [530] 

1990: Boston, Massachusetts, USA, see [563] 

1988: St. Paul, Minnesota, USA, see [1435] 

1987: Seattle, WA, USA, see [723] 

1986: Philadelphia, PA, USA, see [1110, 1111] 

1984: Austin, TX, USA, see [267] 

1983: Washington, DC, USA, see [788] 

1982: Pittsburgh, PA, USA, sec [2143] 

1980: Stanford University, California, USA, see [126] 
AISB: Artificial Intelligence and Simulation of Behaviour + Workshop on Evolutionary 
Computing 

http : //www . aisb . org . uk/ convention/ index . shtml [accused 2008-09-11] 
History: 2008: Aberdeen, UK, see [866] 

2007: Newcastle upon Tyne, UK, sec [2030] 

2006: Bristol, UK, see [2029] 

2005: Hatfield, UK, sec [2028] 

2004: Leeds, UK, see [2027] 

2003: Aberystwyth, UK, see [2026] 

2002: Imperial College, UK, see [2025] 

2001: York, UK, see [2024] 

2000: Birmingham, UK, see [2023] 

1997: Manchester, UK, sec [447] 

1996: Brighton, UK, see [695] 

1995: Sheffield, UK, see [694] 

1994: Leeds, UK, see [693] 
HAIS: International Conference on Hybrid Artificial Intelligence Systems 
http : //gicap .ubu. es/hais2009/ [accessed 2009-03-02] 
History: 2009: Salamanca, Spain, see [79] 

2008: Burgos, Spain, see [443] 

2007: Salamanca, Spain, see [442] 

2006: Ribeirao Preto, SP, Brazil, see [117] 
HIS: International Conference on Hybrid Intelligent Systems 
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http : //www . soft computing . net/hybrid . html [accessed 2007-09-01] 

History: 2008: Barcelona, Spain, see [2267] 

2007: Kaiserslautern, Germany, see [1170] 
2006: Auckland, New Zealand, see [993] 
2005: Rio de Janeiro, Brazil, see [1510] 
2004: Kitakyushu, Japan, see [991] 
2003: Melbourne, Australia, see [8] 
2002: Santiago, Chile, see [7] 
2001: Adelaide, Australia, see [6] 
ICNC: International Conference on Advances in Natural Computation 

History: 2007: Haikou, China, see [995, 996, 997, 998, 999] 
2006: Xi'an, China, see [1052, 1053] 
2005: Changsha, China, see [2151, 2152, 2153] 
IAAI: Conference on Innovative Applications of Artificial Intelligence 

http : //www. aaai . org/Conf erences/IAAI/iaai .php [accused 2007-09-06] 

History: 2006: Boston, Massachusetts, USA, see [805] 

2005: Pittsburgh, Pennsylvania, USA, see [1359] 
2004: San Jose, California, USA, see [1381] 
2003: Acapulco, Mexico, see [1731] 
2002: Edmonton, Alberta, Canada, see [547] 
2001: Seattle, Washington, USA, see [932] 
2000: Austin, Texas, USA, see [1103] 
1999: Orlando, Florida, USA, see [917] 
1998: Madison, Wisconsin, USA, see [1472] 
1997: Providence, Rhode Island, USA, see [1219] 
1996: Portland, Oregon, USA, see [410] 
1995: Montreal, Quebec, Canada, see [22] 
1994: Seattle, Washington, USA, see [318] 
1993: Washington, DC, USA, see [1] 
1992: San Jose, California, USA, see [1844] 
1991: Anaheim, California, USA, sec [1907] 
1990: Washington, DC, USA, see [1706] 
1989: Stanford University, California, USA, see [1835] 
KES: Knowledge-Based Intelligent Information & Engineering Systems 

History: 2007: Vietri sul Mare, Italy, see [75, 76, 77] 
2006: Bournemouth, UK, see [756, 757, 758] 
2005: Melbourne, Australia, see [1129, 1130, 1131, 1132] 
2004: Wellington, New Zealand, see [1514, 1515, 1516] 
2003: Oxford, UK, see [1599, 1600] 
2002: Podere d'Ombriano, Crema, Italy, see [481] 
2001: Osaka and Nara, Japan, see [1037] 
2000: Brighton, UK, see [962, 963] 
1999: Adelaide, South Australia, see [1032] 
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1998: Adelaide, South Australia, see [1033, 1034, 1035] 
1997: Adelaide, South Australia, see [1030, 1031] 
MCDM: International Conference on Multiple Criteria Decision Making 

http : //pro j ect . hkkk . f i/MCDM/conf . html [accessed 2007-09-10] 



History: 2008: 


Auckland, New Zealand, see [620] 


2006: 


Chania, Crete, Greece, see [2333] 


2004: 


Whistler, British Columbia, Canada, see [2165] 


2002: 


Scmmering, Austria, sec [1334] 


2000: 


Ankara, Turkey, see [11671 

' 'J ' L J 


1998: 


Charlottesville, Virginia, USA, see [877] 


1997: 


Cape Town, South Africa, see [1963] 


1995: 


Hagen, Germany, see [645] 


1994: 


Coimbra, Portugal, see [419] 


1992: 


Taipei, Taiwan, see [2069] 


1990: 


Fairfax, USA, see [1916] 


1988: 


Manchester, UK, see [1301] 


1986: 


Kyoto, Japan, see [1500] 


1984: 


Cleveland, Ohio, USA, see [876] 


1982: 


Mons, Belgium, see [893] 


1980: 


Newark, Delaware, USA, see [1467] 


1979: 


Konigswinter, Germany, see [644] 


1977: 


Buffalo, New York, USA, see [2328] 


1975: 


Jouy-en-Josas, France, see [2039] 



Mendel: International Conference on Soft Computing 
http://mendel-conference.org/ [accessed 2007-09-09] 



History: 2009: 


Brno, Czech Republic, 


see 


[292] 


2008: 


Brno, Czech Republic, 


see 


[291] 


2007: 


Prague, Czech Republic, see [1590] 


2006: 


Brno, Czech Republic, 


see 


[293] 


2005: 


Brno, Czech Republic, 


see 


[2084] 


2004: 


Brno, Czech Republic, 


see 


[2083] 


2003: 


Brno, Czech Republic, 


see 


[2082] 


2002: 


Brno, Czech Republic, 


see 


[2081] 


2001: 


Brno, Czech Republic, 


see 


[2086] 


2000: 


Brno, Czech Republic, 


see 


[1591] 


1999: 


Brno, Czech Republic, 


see 


[2080] 


1998: 


Brno, Czech Republic, 


see 


[2079] 


1997: 


Brno, Czech Republic, 


see 


[2078] 


1996: 


Brno, Czech Republic, 


see 


[2077] 


1995: 


Brno, Czech Republic, 


see 


[2076] 



MIC: Metaheuristics International Conference 
History: 2007: Montreal, Canada, see [1449] 
2005: Vienna, Austria, see [2115] 
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2003: Kyoto, Japan, see [988] 

2001: Porto, Portugal, see [1721] 

1999: Angra dos Reis, Brazil, see [1726] 

1997: Sophia Antipolis, France, see [2124] 

1995: Breckenridge, Colorado, USA, see [1589] 
MICAI: Advances in Artificial Intelligence, The Mexican International Conference on Arti- 
ficial Intelligence 
http : //www.micai . org/ [acceded 2008-06-29] 

History: 2007: Aguascalientes, Mexico, see [782] 

2006: Apizaco, Mexico, see [781, 493] 

2005: Monterrey, Mexico, see [783] 

2004: Mexico City, Mexico, see [1442] 

2002: Mcrida, Yucatan, Mexico, see [425] 

2000: Acapulco, Mexico, see [325] 
WOPPLOT: Workshop on Parallel Processing: Logic, Organization and Technology 
History: 1992: Tutzing, Germany (?), see [2068] 

1989: Neubiberg and Wildbad Krcuth, Germany, see [164] 

1986: Neubiberg, see [163] 

1983: Neubiberg, see [162] 

In the general information sections of the following chapters, you will find many conferences 
and workshops that deal with the respective algorithms discussed, so this is just a small 
selection. 

1.6.3 Journals 

Some journals that deal (at least partially) with global optimization algorithms are: 

Journal of Global Optimization, ISSN: 0925-5001 (Print) 1573-2916 (Online), ap- 
pears monthly, publisher: Springer Netherlands, http : //www . springerlink . com/content/ 

100288/ [accessed 2007-09-20] 

The Journal of the Operational Research Society, ISSN: 0160-5682, appears monthly, ed- 
itor^): John Wilson, Terry Williams, publisher: Palgrave Macmillan, The OR Society, 
http : / /www. palgrave- journals . com/jors/ [accessed 2007-09-16] 

IEEE Transactions on Systems, Man, and Cybernetics (SMC), appears Part A/B: bi- 
monthly, Part C: quaterly, editor(s): Donald E. Brown (Part A), Diane Cook (Part B), 
Vladimir Marik (Part C), publisher: IEEE Press, http://www.ieeesmc.org/ [accessed 2007-09- 

16] 

Journal of Heuristics, ISSN: 1381-1231 (Print), 1572-9397 (Online), appears bi-monthly, 
publisher: Springer Netherlands, http://www.springerlink.com/content/102935/ [accessed 

2007-09-16] 

European Journal of Operational Research (EJOR), ISSN: 0377-2217, appears bi- 
weekly, editor(s): Roman Slowinski, Jesus Artalejo, Jean-Charles. Billaut, Robert Dyson, 
Lorenzo Peccati, publisher: North-Holland, Elsevier, http://www.elsevier.com/wps/ 
f ind/journaldescripti on . cws_home/505543/description [accessed 2007-09-21] 
Computers & Operations Research, ISSN: 0305-0548, appears monthly, editor(s): 
Stefan Nickel, publisher: Pergamon, Elsevier, http://www.elsevier.com/wps/find/ 

journaldescription. cws_home/300/descripti OE. [accessed 2007-09-21] 
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Applied Statistics, ISSN: 0035-9254, editor(s): Gilmour, Skinner, publisher: Blackwell Pub- 
lishing for the Royal Statistical Society, http://www.blackwellpublisriing.com/journal. 
asp?ref =0035-9254 [accessed 2007-09-16] 

Applied Intelligence, ISSN: 0924-669X (Print), 1573-7497 (Online), appears bi-monthly, pub- 
lisher: Springer Netherlands, http : //www . springerlink . com/content/100236/ [accessed 2007- 

Artificial Intelligence Review, ISSN: 0269-2821 (Print), 1573-7462 (Online), appears until 
2005, publisher: Springer Netherlands, http://www.springerlink.com/content/100240/ 

[accessed 2007-09-16] 

Journal of Artificial Intelligence Research (J AIR), ISSN: 11076-9757, editor(s): Toby Walsh, 
http : / / www . j air . org/ [accessed 2007-09-16] 

Knowledge and Information Systems, ISSN: 0219-1377 (Print), 0219-3116 (Online), ap- 
pears approx. eight times a year, publisher: Springer London, http : //www. springerlink . 
com/content/0219-1377 [accessed 2007-09-16] and http : //www. springer . com/west/home/ 
computer/inf ormation+systems?SGWID=4- 152-70- 1136715-0 [accessed 2007-09-16] 
SI AM Journal on Optimization (SIOPT), ISSN: 1052-6234 (print) / 1095-7189 (electronic), 
appears quarterly, editor(s): Nicholas I. M. Gould, publisher: Society for Industrial and 
Applied Mathematics, http://www.siam.org/journals/siopt.php [accessed 2008-06-14] 
Applied Soft Computing, ISSN: 1568-4946, appears quarterly, cditor(s): R. Roy, publisher: 
Elsevier B.V., http: //www. sciencedirect . com/science/ journal/15684946 [accessed 2008-06- 

15] 

Advanced Engineering Informatics, ISSN: 1474-0346, appears quaterly, editor(s): J.C. Kunz, 
I.F.C. Smith, T. Tomiyama, publisher: Elsevier B.V., http://www.elsevier.com/wps/ 
f ind/ j ournaldescription . cws_home/ 622240/descript ion [accessed 2oos-os-oi] 
Journal of Machine Learning Research (JMLR), ISSN: 1533-7928, 1532-4435, appears 8 
times/year, editor(s): Lawrence Saul and Leslie Pack Kaelbling, publisher: Microtome Pub- 
lishing, http://jmlr.csail.mit.edu/ [accessed 2008-08-06] 

Annals of Operations Research, ISSN: 0254-5330, 1572-9338, appears monthly, editor(s): 
Endre Boros, publisher: Springer, http://www.springerlink.com/content/0254-5330 k 

International Journal of Applied Metaheuristic Computing (IJAMC, appears starts in 
2010, editor(s): Peng-Yeng Yin, publisher: Information Resources Management Association, 
http : //www. igi-global . com/ journals/details . asp?id=33344 [accessed 2009-0102] 



1.6.4 Online Resources 

Some general, online available ressources on global optimization algorithms are: 



http : //www. mat .univie . ac . at/~neum/glopt .html [accessed 2007-09-20] 
Last update: up-to-date 

Arnold Neumaier's global optimization website which includes links, publica- 
Description: tionS; and software . 

http://www.soft-computing.de/ [accessed 2008-05-18] 
Last update: up-to-date 

Description: Yaochu Jin's size on soft computing including links and conference infos, 
http : //web . if t . uib . no/~antonych/glob.html [accessed 2007-09-20] 
Last update: up-to-date 

Description: Web site with many links maintained by Gennady A. Ryzhikov. 
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http : //www- opt ima . amp. i .kyoto-u. ac . jp/member/ student /hedar/Hedar_f iles/ 
TestGO.htm [accessed 2007-11-06] 
Last update: up-to-date 

Description: A beautiful collection of test problems for global optimization algorithms 
http : //www . c2i . ntu . edu . sg/AI+CI/Resources/ [accessed 2008-20-25] 
Last update: 2006-11-02 

Description: A large collection of links about AI and CI. 



1.6.5 Books 

Some books about (or including significant information about) global optimization algo- 
rithms are: 

Pardalos, Thoai, and Horst [1614]: Introduction to Global Optimization 
Pardalos and Resende [1613]: Handbook of Applied Optimization 
Floudas and Pardalos [691]: Frontiers in Global Optimization 
Dzemyda, Saltenis, and Zilinskas [613]: Stochastic and Global Optimization 
Gandibleux, Sevaux, Sorensen, and T'kindt [766]: Metaheuristics for Multiobjective Optimi- 
sation 

Glover and Kochenberger [813]: Handbook of Metaheuristics 

Torn and Zilinskas [2047]: Global Optimization 

Chiong [391]: Nature- Inspired Algorithms for Optimisation 

Floudas [690]: Deterministic Global Optimization: Theory, Methods and Applications 
Chankong and Haimes [375]: Multiobjective Decision Making Theory and Methodology 
Steuer [1961]: Multiple Criteria Optimization: Theory, Computation and Application 
Haimes, Hall, and Freedman [878]: Multiobjective Optimization in Water Resource Systems 
Charnes and Cooper [376] : Management Models and Industrial Applications of Linear Pro- 
gramming 

Corne, Dorigo, Glover, Dasgupta, Moscato, Poli, and Price [448]: New Ideas in Optimisation 

Gonzalez [832]: Handbook of Approximation Algorithms and Metaheuristics 

Jain and Kacprzyk [1036]: New Learning Paradigms in Soft Computing 

Tiwari, Knowles, Avineri, Dahal, and Roy [2044]: Applications of Soft Computing - Recent 

Trends 

Chawdry, Roy, and Pant [379]: Soft Computing in Engineering Design and Manufacturing 

Siarry and Michalewicz [1875]: Advances in Metaheuristics for Hard Optimization 

Onwubolu and Babu [1580]: New Optimization Techniques in Engineering 

Pardalos and Du [1612]: Handbook of Combinatorial Optimization 

Reeves [1716]: Modern Heuristic Techniques for Combinatorial Problems 

Corne, Oates, and Smith [450]: Telecommunications Optimization: Heuristic and Adaptive 

Techniques 

Kontoghiorghes [1171]: Handbook of Parallel Computing and Statistics 

Bui and Alam [299]: Multi-Objective Optimization in Computational Intelligence: Theory 

and Practice 



2 



Evolutionary Algorithms 



2.1 Introduction 

Definition 2.1 (Evolutionary Algorithm). Evolutionary algorithms 1 (EAs) arc 
population-based mctahcuristic optimization algorithms that use biology-inspired mecha- 
nisms like mutation, crossover, natural selection, and survival of the fittest in order to refine 
a set of solution candidates iteratively. [99, 104, 105] 

The advantage of evolutionary algorithms compared to other optimization methods is 
their "black box" character that makes only few assumptions about the underlying objective 
functions. Furthermore, the definition of objective functions usually requires lesser insight to 
the structure of the problem space than the manual construction of an admissible heuristic. 
EAs therefore perform consistently well in many different problem categories. 

2.1.1 The Basic Principles from Nature 

In 1859, Darwin [485] published his book "On the Origin of Species" 2 in which he identified 
the principles of natural selection and survival of the fittest as driving forces behind the 
biological evolution. His theory can be condensed into ten observations and deductions 
[485, 1375, 2219]: 

1. The individuals of a species posses great fertility and produce more offspring than can 
grow into adulthood. 

2. Under the absence of external influences (like natural disasters, human beings, etc.), the 
population size of a species roughly remains constant. 

3. Again, if no external influences occur, the food resources are limited but stable over 
time. 

4. Since the individuals compete for these limited resources, a struggle for survival ensues. 

5. Especially in sexual reproducing species, no two individuals are equal. 

6. Some of the variations between the individuals will affect their fitness and hence, their 
ability to survive. 

7. A good fraction of these variations are inheritable. 

8. Individuals less fit are less likely to reproduce, whereas the fittest individuals will survive 
and produce offspring more probably. 

9. Individuals that survive and reproduce will likely pass on their traits to their offspring. 

1 http://en.wikipedia.org/wiki/Artificial_evolution [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/The_Origin_of_Species [accessed 2007-07-03] 
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10. A species will slowly change and adapt more and more to a given environment during 
this process which may finally even result in new species. 

Evolutionary algorithms abstract from this biological process and also introduce a change 
in semantics by being goal-driven [2091]. The search space G in evolutionary algorithms is 
then an abstraction of the set of all possible DNA strings in nature and its elements g € G 
play the role of the natural genotypes. Therefore, we also often refer to G as the genome and 
to the elements g € G as genotypes. Like any creature is an instance of its genotype formed 
by embryogenesis 3 , the solution candidates (or phenotypes) x G X in the problem space X 
are instances of genotypes formed by the genotype-phenotype mapping: x = gpm(g). Their 
fitness is rated according to objective functions which are subject to optimization and drive 
the evolution into specific directions. 

2.1.2 The Basic Cycle of Evolutionary Algorithms 

We can distinguish between single-objective and multi-objective evolutionary algorithms, 
where the latter means that we try to optimize multiple, possible conflicting criteria. Our 
following elaborations will be based on these MOEAs. The general area of Evolutionary 
Computation that deals with multi-objective optimization is called EMOO, evolutionary 
multi-objective optimization. 

Definition 2.2 (MOEA). A multi-objective evolutionary algorithm (MOEA) is able to 
perform an optimization of multiple criteria on the basis of artificial evolution [359, 360, 
2101, 534, 537, 716, 1471]. 



Initial Population 

create an initial 
population of random 
individuals 



Evaluation 

compute the objective 
values of the solution 
candidates 





Reproduction 

create new individuals 
from the mating pool by 
crossover and mutation 



1 




Fitness Assignment 

use the objective values 
to determine fitness 
values 



f 




Selection 



select the fittest indi- 
viduals for reproduction 



Figure 2.1: The basic cycle of evolutionary algorithms. 



All evolutionary algorithms proceed in principle according to the scheme illustrated in 
Figure 2.1: 

1. Initially, a population Pop of individuals p with a random genome p.g is created. 

2. The values of the objective functions / € F arc computed for each solution candidate 
p.x in Pop. This evaluation may incorporate complicated simulations and calculations. 

3. With the objective functions, the utility of the different features of the solution candi- 
dates have been determined and a fitness value v(p.x) can now be assigned to each of 
them. This fitness assignment process can, for instance, incorporate a prevalence com- 
parator function cmp F which uses the objective values to create an order amongst the 
individuals. 



3 http://en.wikipedia.org/wiki/Embryogenesis [accessed 2008-03-10] 
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4. A subsequent selection process filters out the solution candidates with bad fitness and 
allows those with good fitness to enter the mating pool with a higher probability. Since 
fitness is subject to minimization in the context of this book, the lower the v(p.x)-values 
are, the higher is the (relative) utility of the individual to whom they belong. 

5. In the reproduction phase, offspring is created by varying or combining the genotypes p.g 
of the selected individuals p £ Mate by applying the search operations searchOp £ Op 
(which are called reproduction operations in the context of EAs). These offspring are 
then subsequently integrated into the population. 

6. If the terminationCriterion() is met, the evolution stops here. Otherwise, the algorithm 
continues at step 2. 

In the following few paragraphs, we will discuss how the natural evolution of a species 
could proceed and put the artificial evolution of solution candidates in an EA into this 
context. When an evolutionary algorithm starts, there exists no information about what is 
good or what is bad. Basically, only some random genes p.x = createQ are coupled together 
as individuals in the initial population Pop(t = 0). I think, back in the Eoarchean 4 , the 
earth age 3.8 billion years ago where most probably the first single-celled life occurred, it 
was probably the same. 

For simplification purposes, we will assume that the evolution does proceed stepwise 
in distinct generations. At the beginning of every generation, nature "instantiates" each 
genotype p.g (given as DNA sequence) as a new phenotype p.x = gpm(p.g) - a living 
organism - for example a fish. The survival of the genes of the fish depends on how good 
it performs in the ocean (F(p.x) =?), in other words, on how fit it is v(p.x). Its fitness, 
however, is not only determined by one single feature of the phenotype like its size (= /i). 
Although a bigger fish will have better chances to survive, size alone does not help if it is 
too slow to catch any prey (= /b). Also its energy consumption / 3 should be low so it does 
not need to eat all the time. Other factors influencing the fitness positively are formae like 
sharp teeth fi and colors that blend into the environment fs so it cannot be seen too easily 
by sharks. If its camouflage is too good on the other hand, how will it find potential mating 
partners (fa fe)7 And if it is big, it will also have a higher energy consumption fx oo f 3 . 
So there may be conflicts between the desired properties. 

To sum it up, we could consider the life of the fish as the evaluation process of its genotype 
in an environment where good qualities in one aspect can turn out as drawbacks in other 
perspectives. In multi-objective evolutionary algorithms, this is exactly the same and I tried 
to demonstrate this by annotating the fish-story with the symbols previously defined in 
the global optimization theory sections. For each problem that we want to solve, we can 
specify multiple so-called objective functions / £ F. An objective function / represents one 
feature that we are interested in. Let us assume that we want to evolve a car (a pretty weird 
assumption, but let's stick with it). The genotype p.g £ G would be the construction plan 
and the phenotype p.x £ X the real car, or at least a simulation of it. One objective function 
f a would definitely be safety. For the sake of our children and their children, the car should 
also be environment- friendly, so that's our second objective function /(,. Furthermore, a 
cheap price f c , fast speed fd, and a cool design f e would be good. That makes five objective 
functions from which for example the second and the fourth are contradictory (f h v f d ). 

After the fish genome is instantiated, nature "knows" about its phcnotypic properties. 
Fitness, however, is always relative; it depends on your environment. I, for example, may 
be considered as a fit man in my department (computer science). If took a stroll to the 
department of sports science, that statement will probably not hold anymore. The same 
goes for the fish, its fitness depends on the other fish in the population (and its prey and 
predators). If one fishpi.x can beat another onep2.x in all categories, i.e., is bigger, stronger, 
smarter, and so on, we can clearly consider it as fitter (px .xyp2 .x => cm\) F (p\.x,p2-x) < 0) 
since it will have a better chance to survive. This relation is transitive but only forms a partial 
order since a fish that is strong but not very clever and a fish that is clever but not strong 



4 http://en.wikipedia.org/wiki/Eoarchean [accessed 2007-07-03] 
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maybe have the same probability to reproduce and hence, are not directly comparable 5 . 
Well, Ok, we cannot decide if a weak fish p 3 .x with a clever behavioral pattern is worse or 
better than a really strong but less cunning one p^.x (cmp F (p 3 .x,p4.x) = 0). Both traits are 
furthered in the evolutionary process and maybe, one fish of the first kind will sometimes 
mate with one of the latter and produce an offspring which is both, intelligent and sporty 6 . 

Multi-objective evolutionary algorithms basically apply the same principles in their fit- 
ness assignment process "assignFitness" . One of the most popular methods for computing 
the fitness is called Pareto ranking 7 . It does exactly what we've just discussed: It first chooses 
the individuals that are beaten by no one (we call this non-dominated set) and assigns a 
good (scalar) fitness value v{p\.x) to them. Then it looks at the rest of the population and 
picks those (P C Pop) which are not beaten by the remaining individuals and gives them a 
slightly worse fitness value v(p.x) > v{p\.x) Vp G P - and so on, until all solution candidates 
have received one scalar fitness. 

Now, how fit a fish is does not necessarily determine directly if it can produce offspring. 
An intelligent fish may be eaten by a shark and a strong one can die from disease. The 
fitness 8 is only some sort of probability of reproduction. The process of selection is always 
stochastic, without guarantees - even a fish that is small, slow, and lacks any sophisticated 
behavior might survive and could produce even more offspring than a highly fit one. 

The evolutionary algorithms work in exactly the same way - they use a selection algo- 
rithm "select" in order to pick the fittest individuals and place them into the mating pool 
Mate. The oldest selection scheme is called Roulette wheel 9 . In the original version of this 
algorithm (intended for fitness maximization) , the chance of an individual p to reproduce is 
proportional to its fitness v(p.x). 

Last but not least, there is the reproduction phase. Fish reproduce sexually. Whenever a 
female fish and a male fish mate, their genes will be recombined by crossover. Furthermore, 
mutations may take place which. Most often, they affect the characteristics of resulting larva 
only slightly [1730]. Since fit fish produce offspring with higher probability, there is a good 
chance that the next generation will contain at least some individuals that have combined 
good traits from their parents and perform even better than them. 

In evolutionary algorithms, we do not have such a thing as "gender". Each individual 
from the mating pool can potentially be recombined with every other one. In the car example, 
this means that we would modify the construction plans by copying the engine of one car 
and placing it into the car body of another one. Also, we could alter some features like the 
shape of the headlights randomly. This way, we receive new construction plans for new cars. 
Our chance that an environment-friendly engine inside a cool-looking car will result in a 
car that is more likely to be bought by the customer is good. If we iteratively perform the 
reproduction process "reproducePop" time and again, there is a high probability that the 
solutions finally found will be close to optimal. 

2.1.3 The Basic Evolutionary Algorithm Scheme 

After this informal outline about the artificial evolution and how we can use it as an opti- 
mization method, let us now specify the basic scheme common to all evolutionary algorithms. 
In principle, all EAs are variations and extensions of the basic approach "simplcEA" defined 
Algorithm 2.1, a cycle of evaluation, selection, and reproduction repeated in each iteration 
t. Algorithm 2.1 relies on functions and prototypes that we will introduce step by step. 

5 Which is a very comforting thought for all computer scientists. 

6 I wonder if the girls in the sports department are open to this kind of argumentation? 

7 Pareto comparisons are discussed in Section 1.2.2 on page 31 and elaborations on Pareto ranking 
can be found in Section 2.3.3. 

8 This definition is fitness is not fully compatible with biological one, see Section 2.1.5 for more 
information on that topic. 

9 The roulette wheel selection algorithm will be introduced in Section 2.4.3 on page 124. 
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Algorithm 2.1: X* < — simplcEA(cmp F , ps) 



Input: cmp F : the comparator function which allows us to compare the utility of two 

solution candidates 
Input: ps: the population size 
Data: t: the generation counter 
Data: Pop: the population 
Data: Mate: the mating pool 

Data: v: the fitness function resulting from the fitness assigning process 
Output: X*: the set of the best elements found 



l begin 



2 
3 
4 
5 
6 
7 
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9 

10 end 



t < — 

Pop < — createPop(ps) 
while ^terminationCriterionQ do 
v < — assignFitness(Pop, cmp F ) 
Mate < — select (Pop, v, ps) 
t< — t+1 

Pop < — reproducePop(Maie) 
return extractPhenotypes(extractOptimalSet (Pop)) 



1. The function "createPop(ps)" , which will be introduced as Algorithm 2.18 in Section 2.5 
on page 137, produces an initial, randomized population consisting of ps individuals in 
the first iteration t = 0. 

2. The termination criterion "terminationCriterion()" checks whether the evolutionary al- 
gorithm should terminate or continue its work, see Section 1.3.4 on page 54. 

3. Most evolutionary algorithms assign a scalar fitness v(p.x) to each individual p by com- 
paring its vector of objective values F(p.x) to other individuals in the population Pop. 
The function v is built by a fitness assignment process "assignFitness" , which we will 
discuss in Section 2.3 on page 111 in more detail. During this procedure, the genotype- 
phenotype mapping is implicitly carried out as well as simulations needed to compute 
the objective functions / € F. 

4. A selection algorithm "select" (see Section 2.4 on page 121) then chooses ps interesting 
individuals from the population Pop and inserts them into the mating pool Mate. 

5. With "reproducePop" , a new population is generated from the individuals inside the 
mating pool using mutation and/or recombination. More information on reproduction 
can be found in Section 2.5 on page 137 and in Definition 2.13. 

6. The functions "cxtractOptimalSet" and "extractPhenotypes" which you can find in- 
troduced in Definition 19.2 on page 308 and Equation 19.1 on page 307 are used to 
extract all the non-prevailed individuals p* from the final population and to return their 
corresponding phenotypes p* .x only. 



2.1.4 From the Viewpoint of Formae 

Let us review our introductory fish example in terms of forma analysis. Fish can, for instance, 
be characterized by the properties "clever" and "strong" . Crudely simplified, both properties 
may be true or false for a single individual and hence define two formae each. A third 
property can be the color, for which many different possible variations exist. Some of them 
may be good in terms of camouflage, others maybe good in terms of finding mating partners. 
Now a fish can be clever and strong at the same time, as well as weak and green. Here, a 
living fish allows nature to evaluate the utility of at least three different formae. 

This fact has first been stated by Holland [940] for genetic algorithms and is termed im- 
plicit parallelism (or intrinsic parallelism). Since then, it has been studied by many different 
researchers [858, 853, 188, 2123]. If the search space and the genotype-phenotype mapping 
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are properly designed, the implicit parallelism in conjunction with the crossover/recombina- 
tion operations is one of the reasons why evolutionary algorithms are such a successful class 
of optimization algorithms. 

2.1.5 Does the natural Paragon Fit? 

At this point it should be mentioned that the direct reference to Darwinian evolution in 
evolutionary algorithms is somehow controversial. Paterson [1619], for example, points out 
that "neither GAs [genetic algorithms] nor GP [Genetic Programming] are concerned with 
the evolution of new species, nor do they use natural selection." On the other hand, nobody 
would claim that the idea of selection has not been borrowed from nature although many ad- 
ditions and modifications have been introduced in favor for better algorithmic performance. 
The second argument concerning the development of different species depends on definition: 
According to Wikipcdia [2219], a species is a class of organisms which are very similar in 
many aspects such as appearance, physiology, and genetics. In principle, there is some el- 
bowroom for us and we may indeed consider even different solutions to a single problem in 
evolutionary algorithms as members of a different species - especially if the binary search 
operation crossover/recombination applied to their genomes cannot produce another valid 
solution candidate. 

Another interesting difference was pointed out by Sharpe [1859] who states that natural 
evolution "only proceed [s] sufficiently fast to ensure survival" whereas evolutionary algo- 
rithms used for engineering need to be fast in order to be feasible and to compete with other 
problem solving techniques. 

Furthermore, although the concept of fitness 10 in nature is controversial [1915], it is 
often considered as an a posteriori measurement. It then defines the ratio of the numbers 
of occurrences of a genotype in a population after and before selection or the number of 
offspring an individual has in relation to the number of offspring of another individual. In 
evolutionary algorithms, fitness is an a priori quantity denoting a value that determines 
the expected number of instances of a genotype that should survive the selection process. 
However, one could conclude that biological fitness is just an approximation of the a priori 
quantity arisen due to the hardness (if not impossibility) of directly measuring it. 

My personal opinion (which may as well be wrong) is that the citation of Darwin here is 
well motivated since there are close parallels between Darwinian evolution and evolutionary 
algorithms. Nevertheless, natural and artificial evolution are still two different things and 
phenomena observed in either of the two do not necessarily carry over to the other. 

2.1.6 Classification of Evolutionary Algorithms 
The Family of Evolutionary Algorithms 

The family of evolutionary algorithms encompasses five members, as illustrated in Figure 2.2. 
We will only enumerate them here in short. In depth discussions will follow in the next 
chapters. 

1. Genetic algorithms (GAs) are introduced in Chapter 3 on page 141. GAs subsume 
all evolutionary algorithms which have bit strings as search space G. 

2. The set of evolutionary algorithms which explore the space of real vectors X C R™ is 
called Evolution Strategies (ES, see Chapter 5 on page 227). 

3. For Genetic Programming (GP), which will be elaborated on in Chapter 4 on 
page 157, we can provide two definitions: On one hand, GP includes all evolutionary 
algorithms that grow programs, algorithms, and these alike. On the other hand, also all 
EAs that evolve tree-shaped individuals are instances of Genetic Programming. 



http : //en . wikipedia . org/ wiki/Fitness_ (biology) [accessed 2008-08-10] 
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4. Learning Classifier Systems (LCS), discussed in Chapter 7 on page 233, are online 
learning approaches that assign output values to given input values. They internally use 
a genetic algorithm to find new rules for this mapping. 

5. Evolutionary programming (EP, see Chapter 6 on page 231) is an evolutionary 
approach that treats the instances of the genome as different species rather than as 
individuals. Over the decades, it has more or less merged into Genetic Programming 
and the other evolutionary algorithms. 




Evolutionary Algorithms 



Figure 2.2: The family of evolutionary algorithms. 



The early research [518] in genetic algorithms (see Section 3.1 on page 141), Genetic 
Programming (see Section 4.1.1 on page 157), and evolutionary programming (see Section 6.1 
on page 231) date back to the 1950s and 60s. Besides the pioneering work listed in these 
sections, at least other important early contribution should not go unmentioned here: The 
Evolutionary Operation (EVOP) approach introduced by Box [260], Box and Draper [261] 
in the late 1950s. The idea of EVOP was to apply a continuous and systematic scheme of 
small changes in the control variables of a process. The effects of these modifications are 
evaluated and the process is slowly shifted into the direction of improvement. This idea 
was never realized as a computer algorithm, but Spcndley et al. [1941] used it as basis for 
their simplex method which then served as progenitor of the downhill simplex algorithm 11 
of Nelder and Mead [1517]. [518, 1276] Satterthwaite's REVOP [1815, 1816], a randomized 
Evolutionary Operation approach, however, was rejected at this time [518]. 

We now have classified different evolutionary algorithms according to their semantics, 
in other words, corresponding to their special search and problem spaces. All five major 
approaches can be realized with the basic scheme defined in Algorithm 2.1. To this simple 
structure, there exist many general improvements and extensions. Since these normally do 
not concern the search or problem spaces, they also can be applied to all members of the 
EA family alike. In the further text of this chapter, we will discuss the major components 
of many of today's most efficient evolutionary algorithms [357]. The distinctive features of 
these EAs are: 

1. The population size or the number of populations used. 

11 We discuss Nelder and Mead [1517] 's downhill simplex optimization method in Chapter 16 on 
page 283. 
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2. The method of selecting the individuals for reproduction. 

3. The way the offspring is included into the population(s). 

Populations in Evolutionary Algorithms 

There exist various way in which an evolutionary algorithm can process its population. 
Especially interesting is how the population Pop(t + 1) of the next iteration is formed as a 
combination of the current one Pop{t) and its offspring. If it only contains this offspring, 
we speak of extinctive selection [1512, 1869]. Extinctive selection can be compared with 
ecosystems of small protozoa 12 which reproduce in a fissiparous 13 manner. In this case, of 
course, the elders will not be present in the next generation. Other comparisons can partly 
be drawn to the sexual reproducing to octopi, where the female dies after protecting the 
eggs until the larvae hatch, or to the black widow spider where the female devours the male 
after the insemination. Especially in the area of genetic algorithms, extinctive strategies are 
also known as generational algorithms. 

Definition 2.3 (Generational). In evolutionary algorithms that are generational [1677], 
the next generation will only contain the offspring of the current one and no parent individ- 
uals will be preserved. 

Extinctive evolutionary algorithms can further be divided into left and right selection 
[2264]. In left extinctive selections, the best individuals are not allowed to reproduce in 
order to prevent premature convergence of the optimization process. Conversely, the worst 
individuals are not permitted to breed in right extinctive selection schemes in order to reduce 
the selective pressure since they would otherwise scatter the fitness too much. 

In algorithms that apply a preservative selection scheme, the population is a combination 
of the next population and the offspring [102, 1064, 1762, 2091]. The biological metaphor for 
such algorithms is that the lifespan of many organisms exceeds a single generation. Hence, 
parent and child individuals compete with each other for survival. 

For Evolution Strategywhich you can find discussed in Chapter 5 on page 227, there 
exists a notation which also can be used describe the generation transition in evolutionary 
algorithms in general [934, 935, 1841, 102]. 

1. A denotes the number of offspring created and 

2. /i is the number of parent individuals. 

Extinctive selection patterns are denoted as (/z, A)-strategies and will create A > \i child 
individuals from the /x available genotypes. From these, they only keep the \x best solution 
candidates and discard the fi parents as well as the A — [i worst children. 

In (/i + A)-strategy, again A children are generated from /i parents, often with A > \i. 
Then, the parent and offspring populations are united (to a population of the size A + /i) 
and from this unison, only the \i best individuals will "survive" . {fi + A)-strategies are thus 
preservative. 

Steady-state evolutionary algorithms [1746, 499, 1538, 365, 1987, 2211], abbreviated by 
SSEA, are preservative evolutionary algorithms with values of A that are relatively low in 
comparison with /i. Usually, A is chosen in a way that a binary search operator crossover is 
applied exactly once per generation. Although steady-state evolutionary algorithms are often 
observed to produce better results than generational EAs. Chafekar et al. [365], for exam- 
ple, introduce steady-state evolutionary algorithms that are able to outperform generational 
NSGA-II (which you can find summarized in ?? on page ??) for some difficult problems. 
In experiments of Jones and Soulc [1066] (primarily focused on other issues), steady-state 
algorithms showed better convergence behavior in a multi-modal landscape. Similar results 



http://en.wikipedia.org/wiki/Protozoa [accessed 200S-03-12] 

http : //en. wikipedia. org/wiki/Binary_f ission [accessed 2008-03-12] 
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have been reported by Chevreux [389] in the context of molecule design optimization. Dif- 
ferent generational selection methods have been compared to the steady-state GENITOR 
approach by Goldberg and Deb [822]. On the other hand, with steady-state approaches, we 
run also the risk of premature convergence. 

Even in preservative strategies, it is not granted that the best individuals will always 
survive. In principle, a (fi + A) strategy can also mean that from /i + A individuals, /x are 
chosen with a certain selection algorithm. Most are randomized, and even if such methods 
pick the best solution candidates with the highest probabilities, they may also select worse 
individuals. At this point, it is maybe interesting to mention that the idea that larger 
populations will always lead to better optimization results does not necessarily always hold, 
as shown by van Nimwegen and Crutchficld [2096] . 

Definition 2.4 (Elitism). An elitist evolutionary algorithm [512, 1261, 359] ensures that 
at least one copy of the best individual(s) of the current generation is propagated on to the 
next generation. 

The main advantage of elitism is that its convergence is guaranteed, meaning that once 
the global optimum has been discovered, the evolutionary algorithm converges to that opti- 
mum. On the other hand, the risk of converging to a local optimum is also higher. Elitism 
is an additional feature of global optimization algorithms - a special type of preservative 
strategy - which is often realized by using a secondary population only containing the 
non-prevailed individuals. This population is updated at the end of each iteration. Such 
an archive-based elitism can be combined with both, generational and preservative strate- 
gies. Algorithm 2.2 specifies the basic scheme of elitist evolutionary algorithms. 



Algorithm 2.2: X* < — clitistEA(cmp F , ps, a) 



Input: cmp F : the comparator function which allows us to compare the utility of two 

solution candidates 
Input: ps: the population size 
Input: as: the archive size 
Data: t: the generation counter 
Data: Pop: the population 
Data: Mate: the mating pool 

Data: Arc: the archive with the best individuals found so far 

Data: v: the fitness function resulting from the fitness assigning process 

Output: X*: the set of best solution candidates discovered 



l begin 
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13 end 



t < — 
Arc < — 

Pop < — createPop(ps) 
while ^tcrminationCriterionQ do 

Arc < — updateOptimalSetN(/lrc, Pop) 
Arc < — pruneOptimalSet(ylrc, as) 
v < — assignFitness( Pop, Arc, cmp F ) 
Mate < — select(Pop, Arc, v, ps) 
t < — t + 1 

Pop < — reproducePop(Moie) 
return extractPhenotypes(extractOptimalSet(Pop U Arc)) 



Let us now outline the new methods and changes introduced in Algorithm 2.2 in short. 

1. The archive Arc is the set of best individuals found by the algorithm. Initially, it is 
the empty set 0. Subsequently, it is updated with the function "updateOptimalSetN" 
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which inserts new, unprevailed elements from the population into it and also removes 
individuals from the archive which are superseded by those new optima. Algorithms that 
realize such updating are defined in Section 19.1 on page 307. 

2. If the optimal set becomes too large - it might theoretically contain uncountable many 
individuals - "pruneOptimalSet" reduces it to a proper size, employing techniques like 
clustering in order to preserve the element diversity. More about pruning can be found 
in Section 19.3 on page 309. 

3. You should also notice that both, the fitness assignment and selection processes, of elitist 
evolutionary algorithms may take the archive as additional parameter. In principle, 
such archive-based algorithms can also be used in non-elitist evolutionary algorithms by 
simply replacing the parameter Arc with 0. 

2.1.7 Configuration Parameters of evolutionary algorithms 

Figure 2.3 illustrates the basic configuration parameters of evolutionary algorithms. The 
performance and success of an evolutionary optimization approach applied to a problem 
given by a set of objective functions F and a problem space X is defined by 




Figure 2.3: The configuration parameters of evolutionary algorithms. 



1. its basic parameter settings like the population size ps or the crossover and mutation 
rates, 

2. whether it uses an archive Arc of the best individuals found and, if so, which pruning 
technology is used to prevent it from overflowing, 

3. the fitness assignment process "assignFitness" and the selection algorithm "select" , 

4. the choice of the search space G and the search operations Op, 

5. and the genotype-phenotype mapping connecting the search Space and the problem 
space. 

In Section 20.1, we go more into detail on how to state the configuration of an optimiza- 
tion algorithm in order to fully describe experiments and to make them reproducible. 
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2.2 General Information 
2.2.1 Areas Of Application 

Some example areas of application of evolutionary algorithms are: 

Application References 

Function Optimization [1562, 1673] 

Multi-Objective Optimization [715, 716, 357, 1054, 1804, 537] 

Combinatorial Optimization [254, 1762, 1270, 1338] 

Engineering, Structural Optimization, and Design [755, 1412, 1554] 

Constraint Satisfaction Problems (CSP) [2091, 1054, 716, 1804] 

Economics and Finance [388, 1975, 503, 640, 409] 

Biology [2075, 704] 

Data Mining and Data Analysis [2178, 445, 797, 444] 

Mathematical Problems [1094] 

Electrical Engineering and Circuit Design [488, 2075] 

Chemistry, Chemical Engineering [1061, 482, 389] 

Scheduling [1360, 374, 1227, 454, 250] 

Robotics [2158] 

Image Processing [322, 1532] 

Networking and Communication [1889, 1890, 453, 1497, 1684, 35] 

see Section 23.2 on page 401 
Medicine [411, 1911] 

Ressource Minimization, Environment Surveillance/Pro- j^gggj 
tcction 

Military and Defense [1393] 
Evolving Behaviors, e.g., for Agents or Game Players [1705] 

For more information see also the application sections of the different members of the evo- 
lutionary algorithm family: genetic algorithms in Section 3.2.1 on page 142, Genetic Pro- 
gramming in Section 4.2.1 on page 160, Evolution Strategy in Section 5.2.1 on page 227, 
evolutionary programming in Section 6.2.1 on page 231, and Learning Classifier Systems 
in Section 7.2.1 on page 233. 



2.2.2 Conferences, Workshops, etc. 



Some conferences, workshops and such and such on evolutionary algorithms are: 



BIOMA: International Conference on Bioinspired Optimization Methods and their Appli- 
cations 

Irttp : / /bioma. ij s . si/ [accessed 2007-06-30] 
History: 2008: Ljubljana, Slovenia, see [670] 

2006: Ljubljana, Slovenia, see [669] 

2004: Ljubljana, Slovenia, see [671] 
CEC: Congress on Evolutionary Computation 
http : / / ieeexplore . ieee . org/ servlet/ opac?punumber=7875 [accessed 2007-09-05] 
History: 2008: Hong Kong, China, see [1409] 

2007: Singapore, see [1005] 



106 2 Evolutionary Algorithms 

2006: Vancouver, BC, Canada, see [2291] 

2005: Edinburgh, Scotland, UK, see [449] 

2004: Portland, Oregon, USA, see [1004] 

2003: Canberra, Australia, see [1803] 

2002: Honolulu, HI, USA, see [703] 

2001: Seoul, Korea, see [1003] 

2000: La Jolla, California, USA, see [1002] 

1999: Washington D.C., USA, see [69] 

1998: Anchorage, Alaska, USA, see [1001] 

1997: Indianapolis, IN, USA, see [106] 

1996: Nagoya, Japan, see [1006] 

1995: Perth, Australia, see [1000] 

1994: Orlando, Florida, USA, see [1411] 
Dagstuhl Seminar: Practical Approaches to Multi-Objective Optimization 
History: 2006: Dagstuhl, Germany, see [283] 

2004: Dagstuhl, Germany, see [281] 
EA/AE: Conference on Artificial Evolution (Evolution Artificielle) 
History: 2007: Tours, France, see [1441] 

2005: Lille, France, see [2000] 

2003: Marseilles, France, see [1283] 

2001: Le Creusot, France, see [428] 

1999: Dunkerque, France, see [711] 

1997: Nimes, France, see [894] 

1995: Brest, France, see [41] 

1994: Toulouse, France, see [40] 
EMO: International Conference on Evolutionary Multi-Criterion Optimization 
History: 2007: Matsushima/Sendai, Japan, see [1555] 

2005: Guanajuato, Mexico, see [422] 

2003: Faro, Portugal, see [719] 

2001: Zurich, Switzerland, see [2331] 
EURO GEN: Evolutionary Methods for Design Optimization and Control with Applications 

to Industrial Problems 
History: 2007: Jyvaskyla, Finland, see [2072] 

2005: Munich, Germany, see [1827] 

2003: Barcelona, Spain, see [147] 

2001: Athens, Greece, see [803] 

1999: Jyvaskyla, Finland, see [1413] 

1997: Triest, Italy, see [1681] 

1995: Las Palmas de Gran Canaria, Spain, see [1059] 
EvoCOP: European Conference on Evolutionary Computation in Combinatorial Optimiza- 
tion 

http://www.evostar.org/ [acceded 2007-09-05] 
Co-located with Evo Workshops and EuroGP. 
History: 2009: Tubingen, Germany, see [455] 
2008: Naples, Italy, see [2094] 
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2007: Valencia, Spain, see [456] 

2006: Budapest, Hungary, see [843] 

2005: Lausanne, Switzerland, see [1700] 

2004: Coimbra, Portugal, see [842] 

2003: Essex, UK, sec [1701] 

2002: Kinsale, Ireland, see [321] 

2001: Lake Como, Milan, Italy, see [235] 
Evo Workshops: Applications of Evolutinary Computing: EvoCoMnet, EvoFIN, EvoIASP, 
EvoINTERACTION, EvoMUSART, EvoPhD, EvoSTOC and EvoTransLog 
http://www.evostar.org/ [.cco»od 2007-08-05] 

Co-located with EvoCOP and EuroGP. 
History: 2009: Tubingen, Germany, see [802] 

2008: Naples, Italy, sec [801] 

2007: Valencia, Spain, see [800] 

2006: Budapest, Hungary, see [1768] 

2005: Lausanne, Switzerland, see [1767] 

2004: Coimbra, Portugal, see [1702] 

2003: Essex, UK, see [1701] 

2002: Kinsale, Ireland, see [321] 

2001: Lake Como, Milan, Italy, see [235] 

2000: Edinburgh, Scotland, UK, see [320] 

1999: Goteborg, Sweden, see [1665] 

1998: Paris, France, see [976] 
FEA: International Workshop on Frontiers in Evolutionary Algorithms 
Was part of the Joint Conference on Information Science 
History: 2005: Salt Lake City, Utah, USA, see [1794] 

2003: Cary, North Carolina, USA, see [639] 

2002: Research Triangle Park, North Carolina, USA, see [353] 

2000: Atlantic City, NJ, USA, sec [2154] 

1998: Research Triangle Park, North Carolina, USA, see [2021] 
1997: Research Triangle Park, North Carolina, USA, see [1865] 
FOCI: IEEE Symposium on Foundations of Computational Intelligence 

History: 2007: Honolulu, Hawaii, USA, see [1388] 
GECCO: Genetic and Evolutionary Computation Conference 
http://www.sigevo.org/ [accessed 2007-08-30] 

A recombination of the Annual Genetic Programming Conference (GP, see Section 4.2.2 on 
page 161) and the International Conference on Genetic Algorithms (ICGA, see Section 3.2.2 
on page 143), also "contains" the International Workshop on Learning Classifier Systems 
(IWLCS, see Section 7.2.2 on page 234). 

History: 2008: Atlanta, Georgia, USA, see [1117, 409, 1393, 1911, 1705] 
2007: London, England, see [2037, 2038] 
2006: Seattle, Washington, USA, see [352] 
2005: Washington, D.C., USA, see [202, 199, 1764, 1766] 
2004: Seattle, Washington, USA, see [544, 545, 1113] 
2003: Chicago, Illinois, USA, sec [334, 335] 
2002: New York, USA, see [1245, 331, 154, 1572, 1326] 
2001: San Francisco, California, USA, see [1937, 833] 
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2000: Las Vegas, Nevada, USA, see [2216, 2210] 

1999: Orlando, Florida, USA, see [142, 1584, 1889] 
GEM: International Conference on Genetic and Evolutionary Methods 

see Section 3.2.2 on page 143 
ICANNGA: International Conference on Adaptive and Natural Computing Algorithms 
before 2005: International Conference on Artificial Neural Nets and Genetic Algorithms 
History: 2007: Warsaw, Poland, see [173, 174] 

2005: Coimbra, Portugal, see [1725] 

2003: Roanne, France, see [1628] 

2001: Prague, Czech Republic, see [1224] 

1999: Portoroz, Slovenia, see [576] 

1997: Norwich, England, see [1902] 

1995: Ales, France, see [1627] 

1993: Innsbruck, Austria, see [36] 
ICNC: International Conference on Advances in Natural Computation 

see Section 1.6.2 on page 89 
Mendel: International Conference on Soft Computing 

see Section 1.6.2 on page 90 
PPSN: International Conference on Parallel Problem Solving from Nature 
http : / /Is 11- www. informatik.uni-dortmund.de/PPSN/ [accessed 2007-09-05] 
History: 2008: Dortmund, Germany, see [1948] 

2006: Reykjavik, Iceland, see [1779] 

2004: Birmingham, UK, see [2285] 

2002: Granada, Spain, see [867] 

2000: Paris, France, see [1830] 

1998: Amsterdam, The Netherlands, see [624] 

1996: Berlin, Germany, sec [2118] 

1994: Jerusalem, Israel, see [492] 

1992: Brussels, Belgium, see [1357] 

1990: Dortmund, Germany, see [1842] 



2.2.3 Journals 

Some journals that deal (at least partially) with evolutionary algorithms are: 

Evolutionary Computation, ISSN: 1063-6560, appears quaterly, editor(s): Marc Schoenauer, 
publisher: MIT Press, http://www.mitpressjournals.org/loi/evco [accessed 2007-09-16] 
IEEE Transactions on Evolutionary Computation, ISSN: 1089-778X, appears bi-monthly, 
editor(s): Xin Yao, publisher: IEEE Computational Intelligence Society, http : //ieee-cis . 
org/pubs/tec/ [accord 2007-09-16] 

Biological Cybernetics, ISSN: 0340-1200 (Print), 1432-0770 (Online), appears bi-monthly, 
publisher: Springer Berlin/Heidelberg, http://www.springerlink.com/content/100465/ 

[accessed 2007-09-16] 

Complex Systems, ISSN: 0891-2513, appears quaterly, editor(s): Stephen Wolfram, publisher: 
Complex Systems Publications, Inc., http://www.complex-systems.com/ [accessed 2007-09-16] 
Journal of Artificial Intelligence Research (J AIR) (see Section 1.6.3 on page 92) 
New Mathematics and Natural Computation (NMNC), ISSN: 1793-0057, appears three times 
a year, editor(s): Paul P. Wang, publisher: World Scientific, http : //www . worldscinet . com/ 

nmnc/ [accessed 2007-09-19] 
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The Journal of the Operational Research Society (see Section 1.6.3 on page 91) 



2.2.4 Online Resources 

Some general, online available ressources on evolutionary algorithms are: 

http://www.lsinia.mx/~ccoello/EMOO/ [accessed 2007-09-20] 
Last update: up-to-date 

EMOO Web page - Dr. Coello Coello's giant bibliography and paper reposi- 
Description. ^ QJ .^ r Qr evolutionary multi-objective optimization. 

http : //www- isf . maschinenbau . uni-dortmund . de/links/ci_links . html [accessed 2007-10-14] 
Last update: up-to-date 

. Computational Intelligence (Cl)-related links and literature, maintained by 

ucscr i]3 Lion. -r.. T 1 

Jorn Mermen 



http://www.aip.de/~ast/EvolCompFAQ/ [accessed 2007-09-10] 
Last update: 2001-04-01 

Frequently Asked Questions of the comp.ai. genetic group by Heitkotter and 
Description: Beasley [Q16] 

http : //nknucc . nknu . edu . tw/~hcwu/pdf /evolec . pdf [accessed 2007-09-16] 
Last update: 2005-02-19 

Description: Lecture Nodes on Evolutionary Computation by Wu [2264] 
http : / /lsl 1-www . cs . uni-dortmund . de/people/beyer/EA- glossary/ [accessed 2008-04-10] 
Last update: 2002-02-25 

Online glossary on terms and definitions in evolutionary algorithms by Beyer 
Description: ^ ^ [2Q1] 

http : //www . illigal . uiuc . edu/web/ [accessed 2008-05-17] 

Last update: up-to-date 

Description: The Illinois Genetic Algorithms Laboratory (IlliGAL) 

http: //www. peterindia.net/Algorithms .html [accessed 2008-05-17] 

Last update: up-to-date 

A large collection of links about evolutionary algorithms, Genetic Program- 
Description. mm g ; g ene ti c algorithms, etc. 

http : //www. f mi .uni- Stuttgart . de/f k/evolalg/ [accessed 2008-05-17] 
Last update: 2003-07-08 

Description: The Evolutionary Computation repository of the University of Stuttgart, 
http : / /dis . ij s . si/f ilipic/ec/ [accessed 200S-05-18] 
Last update: 2007-11-09 

^ . . The Evolutionary Computation repository of the Jozf Stefan Institute in 
Description: 01 

Slovenia 

http : //www. red3d. com/ cwr/ evolve .html [accessed 2008-05-18] 

Last update: 2002-07-27 

Evolutionary Computation and its application to art and design by Craig 
Description: Reynolds 
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http : / / surf . de . mi . net/encore/ [accessed 2008-05-18] 
Last update: 2004-08-26 

ENCORE, the electronic appendix to The Hitch-Hiker's Guide to Evolution- 
Description: &fy Computatiorij see [916] 

http : //www- isf . maschinenbau . uni-dortmund . de/links/ci_links . html [accessed 200S-05-18] 
Last update: 2006-09-13 

Description: A collection of links to computational intelligence / EAs 
http : //www .tik.ee. ethz . ch/sop/education/misc/moeaApplet/ [accessed 2008-10-25] 
Last update: 2008-06-30 

Description: An applet illustrating a multi-objective EA 



2.2.5 Books 

Some books about (or including significant information about) evolutionary algorithms are: 

Back [99]: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolution- 
ary Programming, Genetic Algorithms 

Back, Fogel, and Michalewicz [104]: Handbook of Evolutionary Computation 
Ceollo Coello, Lamont, and van Veldhuizen [361]: Evolutionary Algorithms for Solving Multi- 
Objective Problems 

Deb [537]: Multi-Objective Optimization Using Evolutionary Algorithms 

Coello Coello and Lamont [424]: Applications of Multi-Objective Evolutionary Algorithms 

Eiben and Smith [623]: Introduction to Evolutionary Computing 

Dumitrcscu, Lazzerini, Jain, and Dumitrescu [608]: Evolutionary Computation 

Fogel [696]: Evolutionary Computation: The Fossil Record 

Back, Fogel, and Michalewicz [107]: Evolutionary Computation 1: Basic Algorithms and 
Operators 

Back, Fogel, and Michalewicz [108]: Evolutionary Computation 2: Advanced Algorithms and 
Operators 

Bcntley [181]: Evolutionary Design by Computers 

De Jong [515]: Evolutionary Computation: A Unified Approach 

Weicker [2167]: Evolutionare Algorithmen 

Gerdes, Klawonn, and Kruse [789]: Evolutionare Algorithmen 

Nissen [1535]: Einfiihrung in evolutionare Algorithmen: Optimierung nach dem Vorbild der 
Evolution 

Yao [2284]: Evolutionary Computation: Theory and Applications 
Yu, Davis, Baydar, and Roy [2299]: Evolutionary Computation in Practice 
Yang, Ong, and Jin [2280]: Evolutionary Computation in Dynamic and Uncertain Environ- 
ments 

Morrison [1464]: Designing Evolutionary Algorithms for Dynamic Environments 
Branke [280]: Evolutionary Optimization in Dynamic Environments 
Nedjah, Alba, and Mourelle [1512]: Parallel Evolutionary Computations 
Kosinski [1177]: Advances in Evolutionary Algorithms 

Rothlauf [1765]: Representations for Genetic and Evolutionary Algorithms 
Banzhaf and Eeckman [137]: Evolution and Biocomputation - Computational Models of Evo- 
lution 
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Fogel and Corne [704]: Evolutionary Computation in Bioinformatics 
Johnston [1061]: Applications of Evolutionary Computation in Chemistry 
Clark [411]: Evolutionary Algorithms in Molecular Design 
Chen [388]: Evolutionary Computation in Economics and Finance 
Ghosh and Jain [797]: Evolutionary Computation in Data Mining 

Miettinen, Makela, Neittaanmaki, and Periaux [1412]: Evolutionary Algorithms in Engineer- 
ing and Computer Science 

Fogel [698] : Evolutionary Computation: Principles and Practice for Signal Processing 
Ashlock [85]: Evolutionary Computation for Modeling and Optimization 
Watanabe and Hashem [2158]: Evolutionary Computations - New Algorithms and their Ap- 
plications to Evolutionary Robots 

Cagnoni, Lutton, and Olague [322]: Genetic and Evolutionary Computation for Image Pro- 
cessing and Analysis 

Kramer [1214]: Self- Adaptive Heuristics for Evolutionary Computation 

Lobo, Lima, and Michalewicz [1299]: Parameter Setting in Evolutionary Algorithms 

Spears [1925]: Evolutionary Algorithms - The Role of Mutation and Recombination 

Eiben and Michalewicz [621]: Evolutionary Computation 

Jin [1055]: Knowledge Incorporation in Evolutionary Computation 

Grosan, Abraham, and Ishibuchi [862]: Hybrid Evolutionary Algorithms 

Abraham, Jain, and Goldberg [9]: Evolutionary Multiobjective Optimization 

Kallel, Naudts, and Rogers [1083]: Theoretical Aspects of Evolutionary Computing 

Ghosh and Tsutsui [798]: Advances in Evolutionary Computing - Theory and Applications 

Yang, Shan, and Bui [2279]: Success in Evolutionary Computation 

Pereira and Tavares [1635]: Bio-inspired Algorithms for the Vehicle Routing Problem 



2.3 Fitness Assignment 
2.3.1 Introduction 

With concept of Pareto domination and prevalence comparisons introduced in Section 1.2.2 
on page 27 we define a partial order on the elements in the problem space X. In multi- 
objective optimization, each solution candidate p.x is characterized by a vector of objective 
values F(p.x). Many selection algorithms however cannot work with such vectors and need 
scalar fitness values instead. By assigning a single real number v(p.x) (the fitness) to each 
solution candidate p.x, also a total order is defined on them. 

The fitness assigned to an individual may not just reflect its rank in the population, but 
can also incorporate density/niching information. This way, not only the quality of a solution 
candidate is considered, but also the overall diversity of the population. This can improve the 
chance of finding the global optima as well as the performance of the optimization algorithm 
significantly. If many individuals in the population occupy the same rank or do not dominate 
each other, for instance, such information will be very helpful. 

The fitness v(p.x) thus may not only depend on the solution candidate p.x itself, but on 
the whole population Pop of the evolutionary algorithm (and on the archive Arc of optimal 
elements, if available). In practical realizations, the fitness values are often stored in a special 
member variable in the individual records. Therefore, v(p.x) can be considered as a mapping 
that returns the value of such a variable which has previously been stored there by a fitness 
assignment process "assignFitness" . 

Definition 2.5 (Fitness Assignment). A fitness assignment process "assignFitness" cre- 
ates a function v : X i— > R + which relates a scalar fitness value to each solution candidate in 
the population Pop Equation 2.1 (and archive Arc, if an archive is available Equation 2.2). 

v = assignFitncss(Fop, cmp F ) ^ v(p.x) CR+ y P e Pop (2.1) 
v = assignFitness(Fop, Arc, cmp F ) => v(p.x) eVC R+ \/p g Pop U Arc (2.2) 
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In the context of this book, we generally minimize fitness values, i. e., the lower the 
fitness of a solution candidate the better. Therefore, many of the fitness assignment processes 
based on the prevalence relation will obey to Equation 2.3. This equation represents a general 
relation - sometimes it is useful to violate it for some individuals in the population, especially 
when crowding information is incorporated. 

Pi.x>-p 2 -x =>- v(p\.x) < v(p2-x) Vpi,p2 G PopU Arc (2-3) 



2.3.2 Weighted Sum Fitness Assignment 

The most primitive fitness assignment strategy would be assigning a weighted sum of the 
objective values. This approach is very static and comes with the same problems as weighted 
sum-based approach for defining what an optimum is introduced in Section 1.2.2 on page 29. 
It makes no use of the prevalence relation. For computing the weighted sum of the different 
objective values of a solution candidate, we reuse Equation 1.4 on page 29 from the weighted 
sum optimum definition. The weights have to be chosen in a way that ensures that v(p.x) € 
M + holds for all individuals p. 

v(p.x) — assignFitnessWeightedSum(_Pop) Vp € Pop =>• v(p.x) — g(p.x) (2-4) 



2.3.3 Pareto Ranking 

Another very simple method for computing fitness values is to let them directly reflect the 
Pareto domination (or prevalence) relation. Figure 2.4 and Table 2.1 illustrate the Pareto 
relations in a population of 15 individuals and their corresponding objective values /i and 
/2, both subject to minimization. There are two ways for doing this: First, to each individual, 




T 1 1 1 1 1 1 1 1 1 1 ^ 



u 123456789 10 f; 12 
Figure 2.4: An example scenario for Pareto ranking. 

we can assign a value inversely proportional to the number of other individuals it prevails, 
like v(p-i .x) = jri — — r ttt- We have written such fitness values in the column "Ap. 

^ L I |Vp 2 S-Pop:pi .x>-p 2 -x\ + l 1 

1" of Tabic 2.1 for Pareto optimization, i. e., the special case where the Pareto dominance 
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5 {8,15} 


{1} 


1/3 


1 


6 {8,9,14,15} 


{1,2} 
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2 


7 {9,10,11,14,15} 


{2} 


1/6 


1 


8 {15} 


{1,2,5,6} 


1/2 


4 


9 {14,15} 


{1,2,6,7} 


1/3 


4 


10 {14,15} 


{2,7} 


1/3 


2 


11 {14,15} 


{2,7} 


1/3 


2 


12 {13,14,15} 


{3} 


1/4 


1 


13 {15} 


{2, 3, 12} 


1/2 


3 


14 {15} 


{1, 2,3, 6, 7, 9, 10, 11, 12} 


1/2 


9 


15 


{1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14} 


1 


13 



Table 2.1: The Pareto domination relation of the individuals illustrated in Figure 2.4. 

relation is used to define prevalence. Individuals that dominate many others will here receive 
a lower fitness value than those which are prevailed by many. When taking a look at these 
values, the disadvantage of this approach becomes clear: It promotes individuals that reside 
in crowded region of the problem space and underrates those in sparsely explored areas. 

By doing so, the fitness assignment process achieves exactly the opposite of what we 
want. Instead of exploring the problem space and delivering a wide scan of the frontier 
of best possible solution candidates, it will focus all effort on a small set of individuals. 
We will only obtain a subset of the best solutions and it is even possible that this fitness 
assignment method leads to premature convergence to a local optimum. A good example 
for this problem are the four non-prevailed individuals {1,2,3,4} from the Pareto frontier. 
The best fitness is assigned to the element 2, followed by individual 1. Although individual 
7 is dominated (by 1), its fitness is better than the fitness of the non-dominated element 3. 

The solution candidate 4 gets the worst possible fitness 1, since it prevails no other 
element. Its chances for reproduction are similarly low than those of individual 15 which 
is dominated by all other elements except 4. Hence, both solution candidates will most 
probably be not selected and vanish in the next generation. The loss of solution candidate 
4 will greatly decrease the diversity and even increase the focus on the crowded area near 1 
and 2. 

A much better second approach for fitness assignment is directly based on the domination 
(or prevalence) relation and has first been proposed by Goldberg [821]. Here, the idea is to 
assign the number of individuals it is prevailed by to each solution candidate [1315, 253, 255, 
851]. This way, the previously mentioned negative effects will not occur. The column "Ap 2" 
in Table 2.1 shows that all four non-prevailed individuals now have the best possible fitness 
0. Hence, the exploration pressure is applied to a much wider area of the Pareto frontier. This 
so-called Pareto ranking can be performed by first removing all non-prevailed individuals 
from the population and assigning the rank to them. Then, the same is performed with 
the rest of the population. The individuals only dominated by those on rank (now non- 
dominated) will be removed and get the rank 1. This is repeated until all solution candidates 
have a proper fitness assigned to them. Algorithm 2.3 outlines another simple way to perform 
Pareto ranking. Since we follow the idea of the freer prevalence comparators instead of Pareto 
dominance relations, we will synonymously refer to this approach as Prevalence ranking. 

As already mentioned, the fitness values of all non-prevailed elements in our example 
Figure 2.4 and Table 2.1 are equally 0. However, the region around the individuals 1 and 2 
has probably already extensively been explored, whereas the surrounding of solution candi- 
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Algorithm 2.3: v < — assignFitnessParetoRank(Pop, cmp F ) 

Input: Pop: the population to assign fitness values to 

Input: cmp F : the prevalence comparator defining the prevalence relation 

Data: i,j,cnt: the counter variables 

Output: v. a fitness function reflecting the Prevalence ranking 

1 begin 

2 for i < — len(Pop) — 1 down to do 

3 cat < — 

4 p < — Pop[i] 

5 for j < — len(Pop) — 1 down to do 

// Check whether cmp F (Pop[j]. x, p .a?) < 

6 if (j ^ i) A (Pop[j].a;^p.a;) then cnt < — cnt + 1 

7 t>(p.:r) < cnt 

8 return w 

9 end 



date 4 is rather unknown. A better approach of fitness assignment should incorporate such 
information and put a bit more pressure into the direction of individual 4, in order to make 
the evolutionary algorithm investigate this area more thoroughly. 

2.3.4 Sharing Functions 

Previously, we have mentioned that the drawback of Pareto ranking is that it does not 
incorporate any information about whether the solution candidates in the population reside 
closely to each other or in regions of the problem space which are only sparsely covered by 
individuals. Sharing, as a method for including such diversity information into the fitness 
assignment process, was introduced by Holland [940] and later refined by Deb [532] , Goldberg 
and Richardson [824], and Deb and Goldberg [539]. [1801, 1417, 1558] 

Definition 2.6 (Sharing Function). A sharing function Sh : M + i— » M + is a function 
used to relate two individuals p\ and pi to a value that decreases with their distance 14 
d = dist(pi,p2) hi a wav that it is 1 for d = and if the distance exceeds a specified 
constant a. 

( 1 if d < 

Sh CT (d = dist(pi,p2)) = < Sh CT (d) e [0, 1] if < d < a (2.5) 
[ otherwise 

Sharing functions can be employed in many different ways and are used by a variety 
of fitness assignment processes [824, 532]. Typically, the simple triangular function Sh_tri 
[959] or one of its cither convex (Sh_cvexp) or concave (Sh_ccavp) pendants with the power 
p £ > are applied. Besides using different powers of the distance-cr-ratio, another 

approach is the exponential sharing method Sh_cxp. 



The concept of distance and a set of different distance measures is defined in Section 29.1 on 
page 537. 



Sh_tri CT (o-)d 
Sh_cvex CT p (d) 
Sh_ccav CTiP ((i) 

Sh_oxp CTp (d) 



For sharing, the distance of the individuals in the search space G as well as their distance 
in the problem space X or the objective space Y may be used. If the solution candidates 
are real vectors in the R", we could use the Euclidean distance of the phenotypes of the 
individuals directly, i.e., compute dist euc i(pi.x,p2.x). In genetic algorithms, where the search 
space is the set of all bit strings G — B" of the length n, another suitable approach would be 
to use the Hamming distance 15 dist Ham(Pi-9,P2-g) of the genotypes. The work of Deb [532], 
however, indicates that phenotypical sharing will often be superior to genotypical sharing. 

Definition 2.7 (Niche Count). The niche count m(p, P) [535, 1417] of an individual p is 
the sum its sharing values with all individual in a list P. 

lcn(P)-l 

V P e P ^ m(p, P) = Sh a (dist(p,P[i])) (2.10) 

i=0 

The niche count m is always greater than zero, since p e P and, hence, Sh CT (dist(j>, p)) = 1 
is computed and added up at least once. The original sharing approach was developed for 
single-objective optimization where only one objective function / was subject to maximiza- 
tion. In this case, its value was simply divided by the niche count, punishing solutions in 
crowded regions [1417]. The goal of sharing was to distribute the population over a number 
of different peaks in the fitness landscape, with each peak receiving a fraction of the popu- 
lation proportional to its height [959] . The results of dividing the fitness by the niche counts 
strongly depends on the height differences of the peaks and thus, on the complexity class 1 '' 
of /. On /i e 0(.t), for instance, the influence of m is much bigger than on a / 2 € 0(e x ). 

By multiplying the niche count m to predetermined fitness values v' , we can use this 
approach for fitness minimization in conjunction with a variety of other different fitness 
assignment processes, but also inherit its shortcomings: 

v(p.x) = v'(p.x) * m(p, Pop) , v' = assignFitness(Pop, cmp F ) (2-11) 

Sharing was traditionally combined with fitness proportionate, i. e., roulette wheel se- 
lection 17 . Oei et al. [1558] have shown that if the sharing function is computed using the 
parental individuals of the "old" population and then naively combined with the more so- 
phisticated tournament selection 18 , the resulting behavior of the evolutionary algorithm may 
be chaotic. They suggested to use the partially filled "new" population to circumvent this 
problem. The layout of evolutionary algorithms, as defined in this book, bases the fitness 
computation on the whole set of "new" individuals and assumes that their objective values 
have already been completely determined. In other words, such issues simply do not exist 
in mult i- objective evolutionary algorithms as introduced here and the chaotic behavior does 
occur. 

15 See Definition 29.6 on page 537 for more information on the Hamming distance. 

16 See Section 30.1.3 on page 550 for a detailed introduction into complexity and the O-notation. 

17 Roulette wheel selection is discussed in Section 2.4.3 on page 124. 

18 You can find an outline of tournament selection in Section 2.4.4 on page 127. 
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j otherwise 

f (l-i) P H0<d<<T 
\ otherwise 

= | l-(i) P ii0<d<<r 
\ otherwise 

1 if d < 
if d > a 

, a ~Jt P otherwise 

1 — e p 



(2.6) 
(2.7) 
(2.8) 

(2.9) 
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For computing the niche count m, 0(n 2 ) comparisons are needed. According to Goldberg 
et al. [827], sampling the population can be sufficient to approximate min order to avoid 
this quadratic complexity. 

2.3.5 Variety Preserving Ranking 

Using sharing and the niche counts naively leads to more or less unpredictable effects. Of 
course, it promotes solutions located in sparsely populated niches but how much their fitness 
will be improved is rather unclear. Using distance measures which are not normalized can 
lead to strange effects, too. Imagine two objective functions /i and / 2 . If the values of fi 
span from to 1 for the individuals in the population whereas those of f 2 range from to 
10 000, the components of /i will most often be negligible in the Euclidian distance of two 
individuals in the objective space Y. Another problem is that the effect of simple sharing 
on the pressure into the direction of the Pareto frontier is not obvious either or depends on 
the sharing approach applied. Some methods simply add a niche count to the Pareto rank, 
which may cause non-dominated individuals having worse fitness than any others in the 
population. Other approaches scale the niche count into the interval [0, 1) before adding it 
which not only ensures that non-dominated individuals have the best fitness but also leave 
the relation between individuals at different ranks intact, which does not further variety 
very much. 

Variety Preserving Ranking is a fitness assignment approach based on Pareto ranking 
using prevalence comparators and sharing. We have developed it in order to mitigate all these 
previously mentioned side effects and balance the evolutionary pressure between optimizing 
the objective functions and maximizing the variety inside the population. In the following, 
we will describe the process of Variety Preserving Ranking-based fitness assignment which 
is defined in Algorithm 2.4. 

Before this fitness assignment process can begin, it is required that all individuals with 
infinite objective values must be removed from the population Pop. If such a solution candi- 
date is optimal, i. e., if it has negative infinitely large objectives in a minimization process, 
for instance, it should receive fitness zero, since fitness is subject to minimization. If the indi- 
vidual is infeasible, on the other hand, its fitness should be set to len(Pop) + ^/lcn(Pop) + 1, 
which is one larger than every other fitness values that may be assigned by Algorithm 2.4. 

In lines 2 to 9, we create a list ranks which we use to efficiently compute the Pareto 
rank of every solution candidate in the population. By the way, the word prevalence rank 
would be more precise in this case, since we use prevalence comparisons as introduced in 
Section 1.2.4. Therefore, Variety Preserving Ranking is not limited to Pareto optimization 
but may also incorporate External Decision Makers (Section 1.2.4) or the method of in- 
equalities (Section 1.2.3). The highest rank encountered in the population is stored in the 
variable maxRank. This value may be zero if the population contains only non-prevailed 
elements. The lowest rank will always be zero since the prevalence comparators cmpj? define 
order relations which are non-circular by definition. 19 . We will use maxRank to determine 
the maximum penalty for solutions in an overly crowded region of the search space later on. 

From line 10 to 18, we determine the maximum and the minimum values that each 
objective function takes on when applied to the individuals in the population. These values 
are used to store the inverse of their ranges in the array rangeScales, which we will use to 
scale all distances in each dimension (objective) of the individuals into the interval [0, 1]. 
There are objective functions in F and, hence, the maximum Euclidian distance between 
two solution candidates in the (scaled) objective space becomes It occurs if all the 

distances in the single dimensions are 1. 

The most complicated part of the Variety Preserving Ranking algorithm is between 
line 19 and 33. Here we computed the scaled distance from every individual to each other 



In all order relations imposed on finite sets there is always at least one "smallest" element. 
See Section 27.7.2 on page 463 for more information. 
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Algorithm 2.4: v < — assignFitnessVarictyPrescrving(Pop, cmp F ) 

Input: Pop: the population 

Input: cmp F : the comparator function 

Input: [implicit] F: the set of objective functions 

Data: . . . : sorry, no space here, we'll discuss this in the text 

Output: v. the fitness function 

l begin 

/* If needed: Remove all elements with infinite objective values from Pop 
and assign fitness or len(Pop) + ^/len(Pop) + 1 to them. Then compute the 
prevalence ranks . */ 

ranks < — createList(len(Pop) , 0) 
maxRank < — 

for i < — len(Pop) — 1 down to do 
for j < — i — 1 down to do 

k < — cmp F (Pop[i\x, Pop[j].x) 
if k < then ranksy] < — ranks[j] + 1 
else if k > then ranks[i] « — ranks[i] + 1 

if ranks[i] > maxRank then maxRank < — ranks[i] 

II determine the ranges of the objectives 

rains < — creatcList(|F| , +oo) 
maxs < — createList(|F| , — co) 
foreach p £ Pop do 

for i < — |F| down to 1 do 

if fi(p.x) < mins[i-i] then mins[i-i] < — fi(p-x) 
if fi(p.x) > maxs\i-\] then maxs[i-i] « — fi(p-x) 

rangeScales « — createList(]F| , 1) 
for i < — |F| — 1 down to do 
i if maxs[i] > mins[i] then rangeScales[i] < — 1/ (maxs[i] — mins\i\) 

II Base a sharing value on the scaled Euclidean distance of all elements 

shares < — createList(len(Pop) ,0) 

minShare < hoo 

max Share < oo 

for i « — len(Pop) — 1 down to do 
curShare « — shares[i] 
for j < — i — 1 down to do 
dist < — 

for k < — \F\ down to 1 do 



38 

39 end 



j dist < — dist + [(fk(Pop[i].x) 
s < — Sh_exp /7TT7 , „ ( V dist } 



fk{Pop{j].x)) * rangeScales[k- 



f 1,16 \ 

curShare < — curShare + s 
shares{j] < — shares^] + s 

shares[i] < — curShare 

if curShare < minShare then minShare «■ 
if curShare > maxShare then maxShare 



curShare 
- curShare 



scale ■ 



// Finally, compute the fitness values 

1/ (maxShare — minShare) if maxShare > minShare 

1 otherwise 
for i < — len(Pop) — 1 down to do 
if ranks[i] > then 
I v(Pop[i].x) « — ranks[i] + V maxRank * scale * (shares[i] — minShare) 

else v(Pop[i].x) « — scale* (shares[i] — minShare) 
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solution candidate in the objective space and use this distance to aggregate share values 
(in the array shares). Therefore, again two nested loops are needed (lines 22 and 24). The 
distance components of two individuals Pop[i] and Pop[j] are scaled and summarized in a 
variable dist in line 27. The Euclidian distance between them is Vdist which we use to 
determine a sharing value in 28. We therefore have decided for exponential sharing with 
power 16 and a — y/\F\, as introduced in Equation 2.9 on page 115. For every individual, 
we sum up all the shares (see line 30). While doing so, we also determine the minimum and 
maximum such total share in the variables minShare and max Share in lines 32 and 33. 

We will use these variables to scale all sharing values again into the interval [0, 1] (line 
34), so the individual in the most crowded region always has a total share of 1 and the most 
remote individual always has a share of 0. So basically, we now know two things about the 
individuals in Pop: 

1. their Pareto ranks, stored in the array ranks, giving information about their relative 
quality according to the objective values and 

2. their sharing values, held in shares, denoting how densely crowded the area around 
them is. 

With this information, we determine the final fitness values of an individual p as follows: 
If p is non-prevailed, i. e., its rank is zero, its fitness is its scaled total share (line 38). 
Otherwise, we multiply the square root of the maximum rank, \JmaxRank, with the scaled 
share and add it to its rank (line 37). By doing so, we preserve the supremacy of non- 
prevailed individuals in the population but allow them to compete with each other based on 
the crowdedness of their location in the objective space. All other solution candidates may 
degenerate in rank, but at most by the square root of the worst rank. 

Example 

Let us now apply Variety Preserving Ranking to the examples for Pareto ranking from 
Section 2.3.3. In Table 2.2, we again list all the solution candidates from Figure 2.4 on 
page 112, this time with their objective values obtained with f\ and fi corresponding to 
their coordinates in the diagram. In the third column, you can find the Pareto ranks of the 
individuals as it has been listed in Table 2.1 on page 113. The columns share/u and share/s 
correspond to the total sharing sums of the individuals, unsealed and scaled into [0, 1]. 



X 


fi 


h 


rank share/u 


share / s 


v(x) 


1 


1 


7 





0.71 


0.779 


0.779 


2 


2 


4 





0.239 


0.246 


0.246 


3 


6 


2 





0.201 


0.202 


0.202 


4 


10 


1 





0.022 








5 


1 


8 


1 


0.622 


0.679 


3.446 


6 


2 


7 


2 


0.906 




5.606 


7 


3 


5 


1 


0.531 


0.576 


3.077 


8 


2 


9 


4 


0.314 


0.33 


5.191 


9 


3 


7 


4 


0.719 


0.789 


6.845 


10 


4 


6 


2 


0.592 


0.645 


4.325 


11 


5 


5 


2 


0.363 


0.386 


3.39 


12 


7 


3 


1 


0.346 


0.366 


2.321 


13 


8 


4 


3 


0.217 


0.221 


3.797 


14 


7 


7 


9 


0.094 


0.081 


9.292 


15 


9 


9 


13 


0.025 


0.004 


13.01 



Table 2.2: An example for Variety Preserving Ranking based on Figure 2.4. 
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Figure 2.5: The sharing potential in the Variety Preserving Ranking example 



But first things first; as already mentioned, we know the Pareto ranks of the solution 
candidates from Table 2.1, so the next step is to determine the ranges of values the objective 
functions take on for the example population. These can again easily be found out from 
Figure 2.4. /i spans from 1 to 10, which leads to rangeScale[o] = rangeScale[i] = l /& 
since the maximum of f% is 9 and its minimum is 1. With this, we now can compute the 
(dimensionally scaled) distances amongst the solution candidates in the objective space, the 
values of V dist in algorithm Algorithm 2.4, as well as the corresponding values of the sharing 

function Sh_exp lg distj . We have noted these in Table 2.3, using the upper triangle 

of the table for the distances and the lower triangle for the shares. 

The value of the sharing function can be imagined as a scalar field, as illustrated in 
Figure 2.5. In this case, each individual in the population can be considered as an electron 
that will build an electrical field around it resulting in a potential. If two electrons come 
close, repulsing forces occur, which is pretty much the same what we want to do with Variety 
Preserving Ranking. Unlike the electrical field, the power of the sharing potential falls expo- 
nentially, resulting in relatively steep spikes in Figure 2.5 which gives proximity and density 
a heavier influence. Electrons in atoms on planets are limited in their movement by other 
influences like gravity or nuclear forces, which are often stronger than the electromagnetic 
force. In Variety Preserving Ranking, the prevalence rank plays this role - as you can see in 
Table 2.2, its influence on the fitness is often dominant. 

By summing up the single sharing potentials for each individual in the example, we 
obtain the fifth column of Table 2.3, the unsealed share values. Their minimum is around 
0.022 and the maximum is 0.94. Therefore, we must subtract 0.022 from each of these values 
and multiply the result with 1.131. By doing so, we build the column shares/s. Finally, we 
can compute the fitness values v(x) according to lines 38 and 37 in Algorithm 2.4. 
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Upper triangle: distances. Lower triangle: corresponding share values. 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


1 




0.391 


0.836 


1.25 


0.125 


0.111 


0.334 


0.274 


0.222 


0.356 


0.51 


0.833 


0.863 


0.667 


0.923 




0.012 




0.51 


0.965 


0.512 


0.375 


0.167 


0.625 


0.391 


0.334 


0.356 


0.569 


0.667 


0.67 


0.998 


Q 


7.7E-5 


0.003 




0.462 


0.933 


0.767 


0.502 


0.981 


0.708 


0.547 


0.391 


0.167 


0.334 


0.635 


0.936 


A 

4 


6.1E-7 


1.8E-5 


0.005 




1.329 


1.163 


0.925 


1.338 


1.08 


0.914 


0.747 


0.417 


0.436 


0.821 


1.006 


5 


0.243 


0.003 


2.6E-5 


1.8E-7 




0.167 


0.436 


0.167 


0.255 


0.417 


0.582 


0.914 


0.925 


0.678 


0.898 


6 


0.284 


0.014 


1.7E-4 


1.8E-6 


0.151 




0.274 


0.25 


0.111 


0.255 


0.417 


0.747 


0.765 


0.556 


0.817 


7 


0.023 


0.151 


0.003 


2.9E-5 


0.007 


0.045 




0.512 


0.25 


0.167 


0.222 


0.51 


0.569 


0.51 


0.833 


8 


0.045 


0.001 


1.5E-5 


1.5E-7 


0.151 


0.059 


0.003 




0.274 


0.436 


0.601 


0.933 


0.914 


0.609 


0.778 


9 


0.081 


0.012 


3.3E-4 


4.8E-6 


0.056 


0.284 


0.059 


0.045 




0.167 


0.334 


0.669 


0.67 


0.444 


0.712 


10 


0.018 


0.023 


0.002 


3.2E-5 


0.009 


0.056 


0.151 


0.007 


0.151 




0.167 


0.502 


0.51 


0.356 


0.67 


11 


0.003 


0.018 


0.012 


2.1E-4 


0.001 


0.009 


0.081 


0.001 


0.023 


0.151 




0.334 


0.356 


0.334 


0.669 


12 


8E-5 


0.002 


0.151 


0.009 


3.2E-5 


2.1E-4 


0.003 


2.6E-5 


0.001 


0.003 


0.023 




0.167 


0.5 


0.782 


13 


5.7E-5 


0.001 


0.023 


0.007 


2.9E-5 


1.7E-4 


0.002 


3.2E-5 


0.001 


0.003 


0.018 


0.151 




0.391 


0.635 


14 


0.001 


0.001 


0.001 


9.3E-5 


4.6E-4 


0.002 


0.003 


0.001 


0.007 


0.018 


0.023 


0.003 


0.012 




0.334 


15 


2.9E-5 


1.2E-5 


2.5E-5 


1.1E-5 


3.9E-5 


9.7E-5 


8E-5 


1.5E-4 


3.1E-4 


0.001 


0.001 


1.4E-4 


0.001 


0.023 





Table 2.3: The distance and sharing matrix of the example from Table 2.2. 



The last column of Table 2.2 lists these results. All non-prevailed individuals have re- 
tained a fitness value less than one, lower than those of any other solution candidate in 
the population. However, amongst these best individuals, solution candidate 4 is strongly 
preferred, since it is located in a very remote location of the objective space. Individual 
1 is the least interesting non-dominated one, because it has the densest neighborhood in 
Figure 2.4. In this neighborhood, the individuals 5 and 6 with the Pareto ranks 1 and 2 are 
located. They are strongly penalized by the sharing process and receive the fitness values 
v(5) = 3.446 and v(6) = 5.606. In other words, individual 5 becomes less interesting than 
solution candidate 7 which has a worse Pareto rank. 6 now is even worse than individual 8 
which would have a fitness better by two if strict Pareto ranking was applied. 

Based on these fitness values, algorithms like Tournament selection (see Section 2.4.2) or 
fitness proportionate approaches (discussed in Section 2.4.3) will pick elements in a way that 
preserves the pressure into the direction of the Pareto frontier but also leads to a balanced 
and sustainable variety in the population. The benefits of this approach have been shown, 
for instance, in [1650, 2188]. 

2.3.6 Tournament Fitness Assignment 

In tournament fitness assignment, which is a generalization of the g-level binary tournament 
selection introduced by Weicker [2167], the fitness of each individual is computed by letting 
it compete q times against r other individuals (with r — 1 as default) and counting the 
number of competitions it loses. For a better understanding of the tournament metaphor 
see Section 2.4.4 on page 127, where the tournament selection scheme is discussed. Anyway, 
the number of losses will approximate its Pareto rank, but are a bit more randomized that 
that. If we would count the number of tournaments won instead of the losses, we would 
encounter the same problems than in the first idea of Pareto ranking. 



TODO add remaining fitness 
assignment methods 
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Algorithm 2.5: 



assignFitnessTournament„ r (Pop, cmpj? ) 



Input: q: the number of tournaments per individuals 

Input: r: the number of other contestants per tournament, normally 1 

Input: Pop: the population to assign fitness values to 

Input: cmp F : the comparator function providing the prevalence relation 

Data: i,j,k,z: counter variables 

Data: b: a Boolean variable being true as long as a tournament isn't lost 
Data: p: the individual currently examined 
Output: v. the fitness function 



l 

2 
3 
4 
5 
6 
7 
8 
9 
10 



12 



13 



begin 
for i 



len(Pop) — 1 down to do 

z < q 

p < Pop[i] 

for j < — q down to 1 do 
b < true 

k < — r 

while (k > 0) A b do 

b < — Pop[Lrandom u (o,ien(Fop))j].a;^p.a; 
k< — fc-1 

if b then z < — z — 1 

(p.x) < — z 

return v 



14 end 



2.4 Selection 



2.4.1 Introduction 

Definition 2.8 (Selection). In evolutionary algorithms, the selection 20 operation Mate = 
sclect(Pop,v,ms) chooses ms individuals according to their fitness values v from the popu- 
lation Pop and places them into the mating pool Mate [99, 1242, 232, 1431]. 

Mate = se\ect(Pop, v, ms) => \/p e Mate =>p£ Pop 

Mp e Pop 4j)£GxX 
v(p.x) e M+ Vp e Pop 

(len(Mote) > min{len(Pop) , ms}) A (len(Mate) < ms) 

(2.12) 

On the mating pool, the reproduction operations discussed in Section 2.5 on page 137 
will subsequently be applied. Selection may behave in a deterministic or in a randomized 
manner, depending on the algorithm chosen and its application-dependant implementation. 
Furthermore, elitist evolutionary algorithms may incorporate an archive Arc in the selection 
process, as sketched in Algorithm 2.2. 

Generally, there are two classes of selection algorithms: such with replacement (anno- 
tated with a subscript r ) and such without replacement (annotated with a subscript see 
Equation 2.13) [1809]. In a selection algorithm without replacement, each individual from 
the population Pop is taken into consideration for reproduction at most once and therefore 
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also will occur in the mating pool Mate one time at most. The mating pool returned by 
algorithms with replacement can contain the same individual multiple times. Like in nature, 
one individual may thus have multiple offspring. Normally, selection algorithms are used in 
a variant with replacement. One of the reasons therefore is the number of elements to be 
placed into the mating pool (corresponding to the parameter ms). If len(Pop) < ms, the 
mating pool returned by a method without replacement contains less than ms individuals 
since it can at most consist of the whole population. 

Mate = select^, (Pop, v, ms) =>■ countOccurences(p, Mate) = 1 \/p e Mate (2.13) 

The selection algorithms have major impact on the performance of evolutionary algo- 
rithms. Their behavior has thus been subject to several detailed studies, conducted by, for 
instance, Goldberg and Deb [823], Blickle and Thicle [232], and Zhong ct al. [2318], just to 
name a few. 

Usually, fitness assignment processes are carried out before selection and the selection 
algorithms base their decisions solely on the fitness v of the individuals. It is possible to rely 
on the prevalence relation, i. e., to write select(Pop, cmp F , ms) instead of select(Pop, v, ms), 
thus saving the costs of the fitness assignment process. However, this will lead to the same 
problems that occurred in the first approach of prevalence-proportional fitness assignment 
(see Section 2.3.3 on page 112) and we will therefore not discuss such techniques in this 
book. 

Many selection algorithms only work with scalar fitness and thus need to rely on a fitness 
assignment process in multi-objective optimization. Selection algorithms can be chained - 
the resulting mating pool of the first selection may then be used as input for the next one, 
maybe even with a secondary fitness assignment process in between. In some applications, 
an environmental selection that reduces the number of individuals is performed first and 
then a mating selection follows which extracts the individuals which should be used for 
reproduction. 

Visualization 

In the following sections, we will discuss multiple selection algorithms. In order to ease 
understanding them, we will visualize the expected number of times S(p) that an individual 
p will reach the mating pool Mate for some of the algorithms. 

S(p) = £[countOccurences(p, Mate)] (2-14) 

Therefore, we will use the special case where we have a population Pop of \en(Pop) = 1000 
individuals, pQ..pg 99 and also a target mating pool size ms — 1000. Each individual pi has 
the fitness value v(pi.x) 1 and fitness is subject to minimization. For this fitness, we consider 
two cases: 

1. As sketched in Fig. 2. 6. a, the individual Pi has fitness i, i. e., V\{p .x) = 0,«i(pi.x) = 
1, . . . , vi{p 999 .x) = 999. 

2. Individual pi has fitness (i + l) 3 , i. e., V2{po-x) = l,V2(pi-x) — 3, . . . , v<i (P999 .x) = 
1000 000 000, as illustrated in Fig. 2.6.b. 

2.4.2 Truncation Selection 

Truncation selection 21 , also called deterministic selection or threshold selection, returns the 
k < ms best elements from the list Pop. These elements are copied as often as needed 
until the mating pool size ms reached. For k, normally values like len ( p °p)/2 or i cn (-P°p)/3 are 
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Figure 2.6: The two example fitness cases. 



used. Algorithm 2.6 realizes this scheme by first sorting the population in ascending order 
according to the fitness v. Then, it iterates from to ms and inserts only the elements with 
indices from to k — 1 into the mating pool. 



Algorithm 2.6: Mate < — truncationSelectfc (Pop, v, ms) 



Input: Pop: the list of individuals to select from 
Input: v: the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 
Input: k: cut-off value 
Data: i: counter variables 

Output: Mate: the winners of the tournaments which now form the mating pool 

l begin 

Mate < — () 

k < — min {k, len(Pop)} 
Pop < — sortList a (Pop, v) 
for i < — up to ms — 1 do 

^ Mate < — addListItem(Marfe, Pop[i mod fe]) 

return Mate 
8 end 



Truncation selection is usually used in Evolution Strategies with (fi+X) and (/i, A) strate- 
gies. In general evolutionary algorithms, it should be combined with a fitness assignment 
process that incorporates diversity information in order to prevent premature convergence. 
Recently, Lassig et al. [1260] have proved that truncation selection is the optimal selection 
strategy for crossover, provided that the right value of k is used. In practical applications, 
this value is normally not known. 

In Figure 2.7, we sketch the expected number of offspring for the individuals from our 
examples specified in Section 2.4.1. In this selection scheme, the diagram will look exactly 
the same regardless whether we use fitness configuration 1 or 2, since it is solely based 
on the order of individuals and not on the numerical relation of their fitness. If we set 
k = ms — \en(Pop), each individual will have one offspring in average. If k = \ms. the 
top-50% individuals will have two offspring and the others none. For k = j^ms, only the 
best 100 from the 1000 solution candidates will reach the mating pool but reproduce 10 
times in average. 
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Figure 2.7: The number of expected offspring in truncation selection. 



2.4.3 Fitness Proportionate Selection 

Fitness proportionate selection 22 has already been applied in the original genetic algorithms 
as introduced by Holland [940] and therefore is one of the oldest selection schemes. In fitness 
proportionate selection, the probability P(pi) of an individual pi G Pop to enter the mating 
pool is proportional to its fitness v(p.x) (subject to maximization) compared to the sum of 
the fitness of all individuals. This relation in its original form is defined in Equation 2.15 
below. 

Pip,) = v ViPl - X ] , (2.15) 

There exists a variety of approaches which realize such probability distributions [823], 
like stochastic remainder selection (Brindle [289], Booker [248]) and stochastic universal 
selection (Baker [121], Greffenstette and Baker [858]). The most commonly known method is 
the Monte Carlo roulette wheel selection by De Jong [512], where we imagine the individuals 
of a population to be placed on a roulette 23 wheel as sketched in Fig. 2. 8. a. The size of 
the area on the wheel standing for a solution candidate is proportional to its fitness. The 
wheel is spun, and the individual where it stops is placed into the mating pool Mate. This 
procedure is repeated until ms individuals have been selected. 

In the context of this book, fitness is subject to minimization. Here, higher fitness values 
v(jp.x) indicate unfit solution candidates p.x whereas lower fitness denotes high utility. Fur- 
thermore, the fitness values are normalized into a range of [0, sum], because otherwise, fitness 
proportionate selection will handle the set of fitness values {0, 1, 2} in a different way than 
{10, 11, 12}. Equation 2.19 defines the framework for such a (normalized) fitness proportion- 
ate selection "rouletteWheelSelect" . It is illustrated exemplarily in Fig. 2.8.b and realized 
in Algorithm 2.7 as a variant with and in Algorithm 2.8 without replacement. Amongst 

22 http://en.wikipedia.org/wiki/Fitness_proportionate_selection [accessed 200S-03-19] 

23 http://en.wikipedia.org/wiki/Roulette [accessed 2008-03-20] 



2.4 Selection 125 




Fig. 2. 8. a: Example for fitness maxi- Fig. 2.8.b: Example for normalized fitness min- 

mization. imization. 



Figure 2.8: Examples for the idea of roulette wheel selection. 



others, Whitley [2211] points out that even fitness normalization as performed here cannot 
overcome the drawbacks of fitness proportional selection methods. 



minV = min {v{p.x) Wp € Pop} (2-16) 
maxV = mayi{v(p.x) \/p <E Pop} (2-17) 
maxV— v(p.x) 

normV(p.x) = — {- (2.18) 

max V — mm V 

P{Pi) = ^ vt, \ (2-19) 

z2v P2 ePo P normV{p 2 .x) 



But what are the drawbacks of fitness proportionate selection methods? Let us therefore 
visualize the expected results of roulette wheel selection applied to the special cases stated in 
Section 2.4.1. Figure 2.9 illustrates the number of expected occurrences S(pi) of an individual 
Pi if roulette wheel selection was applied. Since ms = 1000, we draw one thousand times 
a single individual from the population Pop. Each single choice is based on the proportion 
of the individual fitness in the total fitness of all individuals, as defined in Equation 2.15 
and Equation 2.19. Thus, in scenario 1 with the fitness sum 999 * 998 = 498501, the relation 
S(Pi) = ms * 498501 holds for fitness maximization and S(pi) — ms ^gjj£ for minimization. 
As result (sketched in Fig. 2. 9. a), the fittest individuals produce (on average) two offspring, 
whereas the worst solution candidates will always vanish in this example. For the 2 nd scenario 
with V2(pi-x) = [i + l) 3 , the total fitness sum is approximately 2.51 • 10 11 and S(pi) = 

ms 2 e^rjrjn holds for maximization. The resulting expected values depicted in Fig. 2.9.b 
are significantly different from those in Fig. 2. 9. a. The meaning of this is that the design of 
the objective functions (or the fitness assignment process) has a much stronger influence on 
the convergence behavior of the evolutionary algorithm. This selection method only works 
well if the fitness of an individual is indeed something like a proportional measure for the 
probability that it will produce better offspring. 

Thus, roulette wheel selection has a bad performance compared to other schemes like 
tournament selection [823, 231] or ranking selection [823, 232]. It is mainly included here for 
the sake of completeness and because it is easy to understand and suitable for educational 
purposes. 
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Algorithm 2.7: Mate < — rouletteWheelSelect r .(Pop, v, ms) 
Input: Pop: the list ol individuals to select from 
Input: v: the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 
Data: i: a counter variable 

Data: a: a temporary store for a numerical value 
Data: A: the array of fitness values 

Data: min, max, sum: the minimum, maximum, and sum of the fitness values 
Output: Mate: the mating pool 

begin 

A < — createList(len(Pop) ,0) 

min < — oo 
max < oo 

for i < — up to len(Pop) — 1 do 
a < — v(Pop[i].x) 
A[i] < — a 

if a < min then min < — a 
if a > max then max < — a 

if max = min then 

max < — max + 1 
min < — min — 1 



sum < — 
for i < — 

sum 
A[i) « 

for i < — 

a < — 





- up to len(Pop) 

max — A[i] 
max — min 



1 do 



up to ms — 1 do 

searchltem as (random„(0, sum) 

if a < then a < a — 1 

Mate < — addListItem(Mate, Pop[a]) 

return Mate 



A) 



22 end 




Figure 2.9: The number of expected offspring in roulette wheel selection. 
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Algorithm 2.8: Mate < — rouletteWheelSelect„, (Pop, v, ms) 
Input: Pop: the list of individuals to select from 
Input: v. the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 
Data: i: a counter variable 

Data: a, b: temporary stores for numerical values 
Data: A: the array of fitness values 

Data: min, max, sum: the minimum, maximum, and sum of the fitness values 
Output: Mate: the mating pool 

begin 

A < — createList(len(Pop) ,0) 

min < — oo 
max < oo 

for i < — up to len(Pop) — 1 do 
a < — v(Pop[i].x) 



A[i] < a 

if a < min then min *■ 
if a > max then max 

if max = min then 

max < — max + 1 
min < — min — 1 



a 

- a 



sum 
for i 







up to len( Pop) 

_ max — A[i] 
max— min 

sum 



1 do 

sum <■ 

for i < — up to min {ms, len(Pop)} — 1 do 
a < — searchltem as (random„(0, sum) , A) 

if a < then a < a — 1 

if a = then 6 < — 
else b < — A[a-i] 
b < — A[ a ] — b 

for j < — a + 1 up to lcn(A) 1 do 
|_ Ay] < — Ay] - b 

sum < — sum — b 

Mate < — addListItem(Mate, Pop[a}) 
Pop < — deleteListItem(Pop, a) 
A < — deleteListItem(yl, a) 

return Mate 



30 end 



2.4.4 Tournament Selection 

Tournament selection 24 , proposed by Wetzel [2198] and studied by Brindle [289], is one 
of the most popular and effective selection schemes. Its features are well-known and have 
been analyzed by a variety of researchers such as Blickle and Thiele [231, 232], Miller and 
Goldberg [1416], Lee et al. [1269], Sastry and Goldberg [1809], and Oei et al. [1558]. In 
tournament selection, k elements are picked from the population Pop and compared with 
each other in a tournament. The winner of this competition will then enter mating pool 
Mate. Although being a simple selection strategy, it is very powerful and therefore used in 
many practical applications [55, 316, 1403, 46]. 

As example, consider a tournament selection (with replacement) with a tournament size 
of two [2208]. For each single tournament, the contestants are chosen randomly according to 
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a uniform distribution and the winners will be allowed to enter the mating pool. If we assume 
that the mating pool will contain about as same as many individuals as the population, each 
individual will, on average, participate in two tournaments. The best solution candidate of 
the population will win all the contests it takes part in and thus, again on average, contributes 
approximately two copies to the mating pool. The median individual of the population is 
better than 50% of its challengers but will also loose against 50%. Therefore, it will enter the 
mating pool roughly one time on average. The worst individual in the population will lose 
all its challenges to other solution candidates and can only score even if competing against 
itself, which will happen with probability (!/ms) 2 . It will not be able to reproduce in the 
average case because ms * (!/ms) 2 = 1 /ms < 1 Vms > 1. 

For visualization purposes, let us go back to our examples from Section 2.4.1 with a 
population of 1000 individuals po-.pggg and ms = 1000. Again, we assume that each indi- 
vidual has an unique fitness value of v\{pi.x) = i or v^ijpi-x) = (i + l) 3 , respectively. If we 
apply tournament selection with replacement in this special scenario, the expected number 
of occurrences S(j>i) of an individual pi in the mating pool can be computed according to 
Blickle and Thiele [232] as 




(2.20) 



100 200 300 400 500 600 700 v^p.x) 900 

1 le6 8e6 3e7 6e7 le8 2e8 3e8 v 2 ( Pi .x) 7e8 

Figure 2.10: The number of expected offspring in tournament selection. 



The absolute values of the fitness play no role. The only thing that matters is whether 
or not the fitness of one individual is higher as the fitness of another one, not fitness dif- 
ference itself. The expected numbers of offspring for the two example cases 1 and 2 from 
Section 2.4.1 are the same. Tournament selection thus gets rid of the problems of fitness 
proportionate methods. Figure 2.10 depicts these numbers for different tournament sizes 
k = {1,2,3,4,5,10}. If k = 1, tournament selection degenerates to randomly picking indi- 
viduals and each solution candidate will occur one time in the mating pool on average. With 
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rising k, the selection pressure increases: individuals with good fitness values create more 
and more offspring whereas the chance of worse solution candidates to reproduce decreases. 

Tournament selection with replacement (TSR) is presented in Algorithm 2.9. Tournament 
selection without replacement (TSoR) [1269, 18] can be defined in two forms. In the first 
variant specified as Algorithm 2.10, a solution candidate cannot compete against itself. This 
method is defined in. In Algorithm 2.11, on the other hand, an individual may enter the 
mating pool at most once. 



Algorithm 2.9: Mate < — tournamentSelect^fc (Pop, v, ms) 

Input: Pop: the list of individuals to select from 
Input: v : the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 

Input: [implicit] k: the tournament size 

Data: a: the index of the tournament winner 

Data: i,j: counter variables 

Output: Mate: the winners of the tournaments which now form the mating pool 

l begin 

Mate < — () 

Pop < — sortList a (Pop, v) 
for i < — up to ms — 1 do 
a< — L rarl dom u (0, len(Pop))J 
for j < — 1 up to k — 1 do 
| a< — min{a, Lrandom„(0,len(Pop))J} 

Mate < — addListItem(Marfe, Pop[a]) 

9 return Mate 
10 end 



Algorithm 2.10: Mate < — tournamentSelecW^-Pop, v, ms) 
Input: Pop: the list of individuals to select from 
Input: v. the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 

Input: [implicit] k: the tournament size 

Data: a: the index of the tournament winner 

Data: i,j: counter variables 

Output: Mate: the winners of the tournaments which now form the mating pool 
l begin 



2 


Mate < — () 


3 


Pop < — sortList a (Pop, v) 


4 


for i < — up to min {len(Pop) , ms} — 1 do 


5 




a < — L ran dom„(0, len(Pop))J 


6 




for j < — 1 up to min {len(Pop) , k} — 1 do 


7 




j a < — min {a, Lrandom„(0, len(Pop))J } 


8 




Mate < — addListItem(Marfe, Pop[a]) 


9 




Pop < — deleteList!tem(Pop, a) 


10 


return Mate 



ll end 



The algorithms specified here should more precisely be entitled as deterministic tour- 
nament selection algorithms since the winner of the k contestants that take part in each 
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Algorithm 2.11: Mate < — tournamentSelect^fc (Pop, v, ms) 
Input: Pop: the list of individuals to select from 
Input: v: the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 

Input: [implicit] k: the tournament size 

Data: A: the list of contestants per tournament 

Data: a: the tournament winner 

Data: i,j: counter variables 

Output: Mate: the winners of the tournaments which now form the mating pool 

1 begin 

2 Mate < — () 

3 Pop < — sortList a (Pop, v) 

4 for i < — up to ms — 1 do 

5 A^Q 

6 for j < — 1 up to min {k, len(Pop)} do 

7 repeat 

8 i a< — L ran dom„(0, len(Pop))J 

9 until searchltem„(a, A) < 

10 A < — addListItem( J 4, a) 

11 a < — min A 

12 Mate « — addListItem(Marfe, Pop[a\) 

13 return Mate 



14 end 



tournament enters the mating pool. In the non-deterministic variant this is not necessarily 
the case. There, a probability p is defined. The best individual in the tournament is selected 
with probability p, the second best with probability p(l — p), the third best with probability 
p(l — p) 2 and so on. The i th best individual in a tournament enters the mating pool with 
probability p(l — p) 1 . Algorithm 2.12 on the facing page realizes this behavior for a tour- 
nament selection with replacement. Notice that it becomes equivalent to Algorithm 2.9 on 
the previous page if p is set to 1. Besides the algorithms discussed here, a set of additional 
tournament-based selection methods has been introduced by Lee et al. [1269]. 
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Algorithm 2.12: Mate < — tournamentSelect^ k (Pop, v, ms) 

Input: Pop: the list of individuals to select from 
Input: v. the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 
Input: [implicit] p: the selection probability, p £ [0, 1] 
Input: [implicit] k: the tournament size 
Data: A: the set of tournament contestants 
Data: i, j: counter variables 

Output: Mate: the winners of the tournaments which now form the mating pool 

1 begin 

2 Mate < () 

3 Pop < — sortList a (Pop, v) 

4 for i < — up to ms — 1 do 

5 A < — () 

6 for j « — up to k — 1 do 

7 j A< — addListItem(^4, [random„(0, len(Pop))J) 

8 A < — sortList a (^4, cmp(oi, 02) = (a-i — 02)) 

9 for j < — up to len(A) — 1 do 

10 if (random„() < p) V (j > \cn(A) — 1) then 

1 1 Mate < — addListltem ( Mate, Pop[A y ] ] ) 



13 



return Mate 



14 end 



2.4.5 Ordered Selection 

Ordered selection is another approach for circumventing the problems of fitness proportion- 
ate selection methods. Here, the probability of an individual to be selected is proportional 
to (a power of) its position (rank) in the sorted list of all individuals in the population. 
The implicit parameter k e R + of the ordered selection algorithm determines the selection 
pressure. It equals to the number of expected offspring of the best individual and is thus 
much similar to the parameter k of tournament selection. The bigger k gets, the higher is 
the probability that individuals which are non-prevailed i.e., have good objective values will 
be selected. 

Algorithm 2.13 demonstrates how ordered selection with replacement works and the 
variant without replacement is described in Algorithm 2.14. Basically, it first converts the 
parameter Ho a power q to which the uniformly drawn random numbers are raised that 
arc used for indexing the sorted individual list. This can be achieved with Equation 2.21. 

1= ~ i osfc (2-21) 

log ms 

Figure 2.11 illustrates the expected offspring in the application of ordered selection with 
k e {1, 2, 3, 4, 5}. Like tournament selection, a value of k = 1 leads degenerates the evolution- 
ary algorithm to a parallel random walk. Another close similarity to tournament selection 
occurs when comparing the exact formulas computing the expected offspring for our exam- 
ples: 

s <*'> = ™* ((^)'-(ick)') (2 - 22 > 

Equation 2.22 looks pretty much like Equation 2.20. The differences between the two 
selection methods become obvious when comparing the diagrams Figure 2.11 and Figure 2.10 
which both are independent of the actual fitness values. Tournament selection creates many 
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Algorithm 2.13: Mate < — orderedSelectf!(Pop, v, ms) 
Input: Pop: the list of individuals to select from 
Input: v. the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 
Input: [implicit] k: the parameter of the ordering selection 
Data: q: the power value to be used for ordering 
Data: i: a counter variable 
Output: Mate: the mating pool 

l begin 



2 




3 Mate < — () 

4 Pop < — sortList a (Pop, v) 

5 for i < — up to ms — 1 do 

6 j Mate< — addListItem(Mote, Pop[Lrandom„()p*icn(Po P )j]) 

7 return Mate 

8 end 



Algorithm 2.14: Mate < — orderedSelect^, (Pop, v, ms) 



Input: Pop: the list of individuals to select from 
Input: v: the fitness values 

Input: ms: the number of individuals to be placed into the mating pool Mate 
Input: [implicit] k: the parameter of the ordering selection 
Data: q: the power value to be used for ordering 
Data: i,j: counter variables 
Output: Mate: the mating pool 

1 begin 

2 Q< i 

log ms 

3 Mate < — () 

4 Pop < — sortList a (Pop, v) 

5 for i < — up to min {ms, len(Pop)} — 1 do 

6 j < — L ran dom„() p * len(Pop)J 

7 Mate< — addListftem(Marfe, Pop[j]) 

8 Pop < — deleteListItem(Pop, j) 

9 return Mate 



10 end 



copies of the better fraction of the population and almost none of the others. Ordered 
selection focuses on an even smaller group of the fittest individuals but also even the worst 
solution candidates still have a survival probability not too far from one. In other words, 
while tournament selection reproduces a larger group of good individuals and kills most of 
the others, ordered selection assigns very high fertility to very few individuals but preservers 
also the less fitter ones. 
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Figure 2.11: The number of expected offspring in ordered selection. 



2.4.6 Ranking Selection 

Ranking selection, introduced by Baker [120] and more thoroughly discussed by Whitley 
[2211], Blickle and Thiele [232, 230], and Goldberg and Deb [823] is another approach for 
circumventing the problems of fitness proportionate selection methods. In ranking selection 
[120, 2211, 858], the probability of an individual to be selected is proportional to its position 
(rank) in the sorted list of all individuals in the population. Using the rank smoothes out 
larger differences of the objective values and emphasizes small ones. Generally, we can the 
conventional ranking selection method as the application of a fitness assignment process 
setting the rank as fitness (which can be achieved with Pareto ranking) and a subsequent 
fitness proportional selection. 

2.4.7 VEGA Selection 

The Vector Evaluated Genetic Algorithm by Schaffer [1821, 1822] applies a special selection 
algorithm which does not incorporate any preceding fitness assignment process but works on 
the objective values directly. For each of the objective functions /, S F, it selects a subset of 
the mating pool Mate of the size ms /\F\. Therefore it applies fitness proportionate selection 
which is based on fi instead of a fitness assignment "assignFitness" . The mating pool is then 
a mixture of these sub-selections. Richardson et al. [1728] show in [1820] that this selection 
scheme is approximately the same as if computing a weighted sum of the fitness values. As 
pointed out by Fonseca and Fleming [714], in the general case, this selection method will 
sample non-prevailed solution candidates at different frequencies. Schaffer also anticipated 
that the population of his GA may split into different species, each particularly strong in 
one objective, if the Pareto frontier is concave. 
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Algorithm 2.15: Mate < — vegaSelect(Pop, F, ms) 



Input: Pop: the list of individuals to select from 
Input: F: the objective functions 

Input: ms: the number of individuals to be placed into the mating pool Mate 
Data: i: a counter variable 

Data: j: the size of the current subset of the mating pool 
Data: A: a temporary mating pool 
Output: Mate: the individuals selected 



l begin 



8 





1 up to \F\ do 



rr i- .v 

W\ 



Mate ■ 
for i < 

j < 

if i = 1 then j < — j + ms mod \F\ 
A < — rouletteWheelSelect r (Pop, v = fi,j) 
Mate « — appendList( Mate, A) 

return Mate 



9 end 



2.4.8 Clearing and Simple Convergence Prevention (SCP) 

In our experiments (especially in Genetic Programming and problems with discrete objective 
functions) we often use a very simple mechanism to prevent premature convergence (see 
Section 1.4.2) which we outline in Algorithm 2.17. In our opinion, this SCP method is 
neither a fitness nor a selection algorithm, but we think it fits best into this section. 

The idea is simple: the more similar individuals we have in the population, the more 
likely are we converged. We do not know whether we have converged to a global optimum 
or to a local one. If we got stuck at a local optimum, we should maybe limit the fraction of 
the population which resides at this spot. In case we have found the global optimum, this 
approach does not hurt, because in the end, one single point on this optimum suffices. 



Clearing 

The first one to apply such an explicit limitation method was Petrowski [1638, 1639] whose 
clearing approach is applied in each generation and works as specified in Algorithm 2.16 
where fitness is subject to minimization. Basically, clearing divides the population of an EA 
into several sub-populations according to a distance measure dist applied in the genotypic 
(G) or phenotypic space (X) in each generation. The individuals of each sub-population have 
at most the distance a to the fittest individual in this niche. Then, the fitness of all but the 
k best individuals in such a sub-population is set to the worst possible value. This effectively 
prevents that a niche can get too crowded. Sareni and Krahenbiihl [1801] showed that this 
method is very promising. Singh and Deb [1892] suggest a modified clearing approach which 
shifts individuals that would be cleared farther away and reevaluates their fitness. 



SCP 

We modified this approach in two respects: We measure similarity not in form of a distance 
in G or X, but in the objective space Y C KJ F L All individuals are compared with each 
other. If two have exactly the same objective values 25 , one of them is thrown away with 

25 The exactly-the-same-criterion makes sense in combinatorial optimization and many Genetic 
Programming problems but may easily be replaced with a limit imposed on the Euclidian distance 
in real-valued optimization problems, for instance. 
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Algorithm 2.16: Pop' < — clcaring(Pop, a, k) 



Input: Pop: the list of individuals to apply clearing to 
Input: a: the clearing radius 
Input: k: the nieche capacity 
Input: [implicit] v: the fitness values 

Input: [implicit] dist: a distance measure in the genome or phenome 
Data: n: the current number of winners 
Data: i,j: counter variables 
Output: Pop': the pruned population 

begin 

Pop' « — sortList a (Pop, v) 
for i < — up to len(Pop') — 1 do 
if v(Pop' \i].x) < oo then 
n < — 1 



10 end 



for j < — i + 1 up to len(Pop') — 1 do 

if (v(Pop'[j].x) < oo) A (dist (Pop' [i], Pop'[j]) < a) then 
if n < k then n < — n + 1 
else v(Pop'[j].x) < — oo 



probability 21 ' cp G [0, 1] and does not take part in any further comparisons. This way, we 
weed out similar individuals without making any assumptions about G or X and make room 
in the population and mating pool for a wider diversity of solution candidates. For cp = 0, 
this prevention mechanism is turned off, for cp = 1, all remaining individuals will have 
different objective values. 

Although this approach is very simple, the results of our experiments were often sig- 
nificantly better with this convergence prevention method turned on than without it 
[1650, 2188]. Additionally, in none of our experiments, the outcomes were influenced nega- 
tively by this filter, which makes it even more robust than other methods for convergence 
prevention like sharing or variety preserving. Algorithm 2.17, which has to be applied after 
the evaluation of the objective values of the individuals in the population and before any 
fitness assignment or selection takes place, specifies how our simple mechanism works. 

If an individual p occurs n times in the population or if there are n individuals with 
exactly the same objective values, Algorithm 2.17 cuts down the expected number of their 
occurrences S(p) to 

S(P) = ± (1 - CP)- 1 - E (1 - cpf = (1 - ^ - 1 = '-^-^ (2.23) 

z — ' z — ' —CV CP 

i=l i=0 1 r 

In Figure 2.12, we sketch the expected number of remaining instances of the individual 
p after this pruning process if it occurred n times in the population before Algorithm 2.17 
was applied. 

From Equation 2.23 follows that even a population of infinite size which has fully con- 
verged to one single value will probably not contain more than ^ copies of this individual 
after the simple convergence prevention has been applied. This threshold is also visible in 
Figure 2.12. 

lim S{p) = lim 1 ~ (1 ~ CPT = = 1 (2.24) 

n^oo n^oo cp Cp Cp 

In Petrowski's clearing approach [1638], the maximum number of individuals which can 
survive in a niche was a fixed constant k and, if less than k individuals resided in a niche, 

26 instead of defining a fixed threshold k 
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Algorithm 2.17: Pop' < — convergencePreventionSCP(Pop, cp) 
Input: Pop: the list ol individuals to apply convergence prevention to 
Input: cp: the convergence prevention probability, cp £ [0, 1] 
Input: [implicit] F: the set of objective functions 
Data: i,j: counter variables 

Data: p: the individual checked in this generation 
Output: Pop': the pruned population 



begin 

Pop' 
for i 







1 do 



10 

ll end 



up to len( Pop) 
p < — Pop[i] 

for j < — len(Pop') — 1 down to do 

if f(p.x) = f(Pop'ifyx) V/6F then 
if random u () < cp then 
| Pop' < — deleteListItem(Pop', j) 



Pop' * — addListItem(Pop',p) 
return Pop' 




none of them would be affected. Different from that, an expected value of the number of 
individuals allowed in a niche is specified with the probability cp and may be both, exceeded 
or undercut. Another difference of the approaches arises from the space in which the distance 
is computed. 



Discussion 

Whereas clearing prevents the EA from concentrating too much on a certain area in the 
search or problem space, SCP stops it from keeping too many individuals with equal utility. 
The former approach works against premature convergence to a certain solution structure 
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while the latter forces the EA to "keep track" of a trail to solution candidates with worse 
fitness which may later evolve to good individuals with traits different from the currently 
exploited ones. 

Which of the two approaches is better has not yet been tested with comparative experi- 
ments and is part of our future work. At the present moment, we assume that in real- valued 
search or problem spaces, clearing should be more suitable whereas we know from exper- 
iments using our approach only that SCP performs very good in combinatorial problems 
[1650, 2188] Genetic Programming (see Section 21.3.2, for instance). 

TODO add remaining selection 
algorithms 

2.5 Reproduction 

An optimization algorithm uses the information gathered up to step t for creating the so- 
lution candidates to be evaluated in step t + 1. There exist different methods to do so. 
In evolutionary algorithms, the aggregated information corresponds to the population Pop 
and the set of best individuals Arc if such an archive is maintained. The search operations 
searchOp € Op in used in the evolutionary algorithm family are called reproduction oper- 
ation, inspired by the biological procreation mechanisms 2 ' of mother nature [1730]. There 
are four basic operations: 

1. Creation has no direct natural paragon; it simple creates a new genotype without any 
ancestors or heritage. Hence, it roughly can be compared with the occurrence of the first 
living cells from out a soup of certain chemicals 28 . 

2. Duplication resembles the cell division 29 , resulting in two individuals similar to one 
parent. 

3. Mutation in evolutionary algorithms corresponds to small, random variations in the 
genotype of an individual, exactly like its natural counterpart 30 . 

4. Like in sexual reproduction, recombination '' 1 combines two parental genotypes to a new 
genotype including traits from both elders. 

In the following, we will discuss these operations in detail and provide general definitions 
form them. 

Definition 2.9 (Creation). The creation operation "create" is used to produce a new 
genotype j£G with a random configuration. 

g = create() ^jeG (2.25) 

When an evolutionary algorithm starts, no information about the search space has been 
gathered yet. Hence, we cannot use existing solution candidates to derive new ones and 
search operations with an arity higher than zero cannot be applied. Creation is thus used 
to fill the initial population Pop(t = 0). 

Definition 2.10 (Duplication). The duplication operation duplicate : G i— > G is used to 
create an exact copy of an existing genotype jeG. 

g = duplicate^) Mg G G (2.26) 

27 http://en.wikipedia.org/wiki/Reproduction [accessed 2007-07-03] 

28 http://en.wikipedia.org/wiki/Abiogenesis [accessed 2008-03-17] 

29 http://en.wikipedia.org/wiki/Cell_division [accessed 2008-03-17] 

30 http://en.wikipedia.org/wiki/Mutation [accessed 2007-07-03] 

31 http://en.wikipedia.org/wiki/Sexual_reproduction [accessed 2008-03-17] 
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Duplication is just a placeholder for copying an element of the search space, i. e., it is 
what occurs when neither mutation nor recombination are applied. It is useful to increase 
the share of a given type of individual in a population. 

Definition 2.11 (Mutation). The mutation operation mutate : G i— > G is used to create a 
new genotype g n £ G by modifying an existing one. The way this modification is performed 
is application-dependent. It may happen in a randomized or in a deterministic fashion. 

g n = mutatc(g) : g S G => g n € G (2.27) 

Definition 2.12 (Recombination). The recombination (or crossover 32 ) operation 
recombine : G x G h G is used to create a new genotype g n € G by combining the 
features of two existing ones. Depending on the application, this modification may happen 
in a randomized or in a deterministic fashion. 

g n = rccombinc(g Q , g b ) : g a ,g b e G => g n e G (2.28) 

Notice that the term recombination is more general than crossover since it stands for 
arbitrary search operations that combines the traits of two individuals. Crossover, however, 
is only used if the elements search space G are linear representations. Then, it stands for 
exchanging parts of these so-called strings. 

Now we can define the set Op EA of search operations most commonly applied in evolu- 
tionary algorithms as 

Op EA = {create, duplicate, mutate, recombine} (2.29) 

All of them can be combined arbitrarily. It is, for instance, not unusual to mutate the results 
of a recombination operation, i. e., to perform mutate(recombine(g r i, 52))- 

The four operators are altogether used to reproduce whole populations of individuals. 

Definition 2.13 (rcproducePop). The population reproduction operation Pop = 
reproducePop(Mafe) is used to create a new population Pop by applying the reproduction 
operations to the mating pool Mate. 



Pop = rcproducePop(Mate) =^> \/p e Mate 4p£ 

\/p G Pop => p.g - 
P-9 z 
P-9 z 
P-9 " 



P, \/p e Pop => pe P, len(Pop) = len(Mate) 
= crcatc() V 

= duplicate(p o;d .g) : p u G Mate V 
= mutate(p oU .g) : p oU G Mate V 
= recombme(p oldl .g,p old2 .g) : 
Poidi,Poid2 € Mate 

(2.30) 



For creating an initial population of the size s, we furthermore define the function 
createPop(s) in Algorithm 2.18. 

2.5.1 NCGA Reproduction 

The Neighborhood Cultivation Genetic Algorithm by Watanabe et al. [2160] discussed in 
?? uses a special reproduction method. Recombination is performed only on neighboring 
individuals, which leads to child genotypes close to their parents. This so-called neighbor- 
hood cultivation shifts the recombination-operator more into the direction exploitation, i.e., 
NCGA uses crossover for investigating the close surrounding of known solution candidates. 
The idea is that parents that do not differ much from each other are more likely to be com- 
patible in order to produce functional offspring than parents that have nothing in common. 



http : //en. wikipedia. org/wiki/Recombination [accessed 2007-07-03] 
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Algorithm 2.18: Pop < — createPop(s) 



Input: s: the number of individuals in the new population 
Input: [implicit] create: the creation operator 
Data: i: a counter variable 

Output: Pop: the new population of randomly created individuals (len(Pop) = s) 

l begin 

Pop «— () 

for i < — up to s — 1 do 

| Pop < — addListItem(Pop, create()) 

return Pop 
6 end 



Neighborhood cultivation is achieved in Algorithm 2.19 by sorting the mating pool along one 
focused objective. Then, the elements situated directly besides each other are recombincd. 
The focus on the objective rotates in a way that in a three-objective optimization the first 
objective is focused at the beginning, then the second, then the third and after that again 
the first. The algorithm shown here receives the additional parameter foe which denotes 
the focused objective. Both, recombination and mutation are performed with an implicitly 
defined probability (r and to, respectively). 



Algorithm 2.19: Pop < — ncgaReproducePopy oc (M<rfe) 



Input: Mate: the mating pool 

Input: foe: the objective currently focused 

Input: [implicit] recombine, mutate: the recombination and mutation routines 
Input: [implicit] r,m: the probabilities of recombination and mutation 
Data: i: a counter variable 

Output: Pop: the new population with len(Pop) — len(Mate) 

1 begin 

2 Pop < — sortList a (Mate, ff oc ) 

3 for i < — up to len(Pop) — 1 do 

4 if (randoniuQ < r) A (i < len(Pop) — 1) then Pop\i\ < — recombine (Pop\{\, Pop{i+i]) 

5 if random^) < m then Pop[i] < — mutate(Pop[»]) 

6 return Pop 



7 end 



2.6 Algorithms 

Besides the basic evolutionary algorithms introduced in Section 2.1.3 on page 98, there exists 
a variety of other, more sophisticated approaches. Many of them deal especially with multi- 
objective optimization which imposes new challenges on fitness assignment and selection. In 
this section we discuss the most prominent of these evolutionary algorithms. 



2.6.1 VEGA 

The very first multi-objective genetic algorithm is the Vector Evaluated Genetic Algorithm 
(VEGA) created by Schaffcr [1821, 1822] in the mid-1980s. The main difference between 
VEGA and the basic form of evolutionary algorithms is the modified selection algorithm 
which you can find discussed in Section 2.4.7 on page 133. This selection algorithm solely 
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relies on the objective functions F and does not use any preceding fitness assignment process 
nor can it incorporate a prevalence comparison scheme cmp F . However, it has severe weak- 
nesses also discussed in Section 2.4.7 and thus cannot be considered as an efficient approach 
to multi-objective optimization. 



Algorithm 2.20: X* < — vega(F, s) 



Input: F: the objective functions 
Input: ps: the population size 
Data: t: the generation counter 
Data: Pop: the population 
Data: Mate: the mating pool 

Data: v: the fitness function resulting from the fitness assigning process 
Output: X*: the set of the best elements found 



l begin 



2 
3 
4 
5 
6 
7 

8 

9 end 



t < — 

Pop < — createPop(ps) 



while terminationCriterionQ do 
Mate < — vegaSelect(Pop, F, ps) 
t < — t + 1 

Pop i — reproducePop(Mate) 
return extractPhenotypes(extractOptimalSet (Pop)) 



TODO add remaining EAs 



3 



Genetic Algorithms 



3.1 Introduction 

Genetic algorithms 1 (GAs) are a subclass of evolutionary algorithms where the elements 
of the search space G are binary strings (G = B*) or arrays of other elementary types. As 
sketched in Figure 3.1, the genotypes are used in the reproduction operations whereas the 
values of the objective functions / £ F are computed on basis of the phenotypes in the 
problem space X which are obtained via the genotype-phenotype mapping "gpm". [821, 
940, 916, 2208] 

The roots of genetic algorithms go back to the mid-1950s, where biologists like Barricelli 
[150, 151, 152, 153] and the computer scientist Fraser [742] began to apply computer-aided 
simulations in order to gain more insight into genetic processes and the natural evolution and 
selection. Brcmermann [287] and Bledsoe [216, 215, 217, 218] used evolutionary approaches 
based on binary string genomes for solving inequalities, for function optimization, and for 
determining the weights in neural networks in the early 1960s [219]. At the end of that 
decade, important research on such search spaces was contributed by Bagley [116] (who 
introduced the term genetic algorithm), Rosenberg [1760], Cavicchio, Jr. [354, 355], and 
Frantz [741] - all based on the ideas of Holland at the University of Michigan. As a result of 
Holland's work [937, 939, 940, 938] genetic algorithms as a new approach for problem solving 
could be formalized finally became widely recognized and popular. Today, there are many 
applications in science, economy, and research and development [1681] that can be tackled 
with genetic algorithms. Therefore, various forms of genetic algorithms [423] have been 
developed to. Some genetic algorithms 2 like the human-based genetic algorithms' (HBGA), 
for instance, even require human beings for evaluating or selecting the solution candidates 
[1884, 1997, 1998, 1178, 883] 

It should further be mentioned that, because of the close relation to biology and since ge- 
netic algorithms were originally applied to single-objective optimization, the objective func- 
tions / here are often referred to as fitness functions. This is a historically grown misnaming 
which should not be mixed up with the fitness assignment processes discussed in Section 2.3 
on page 111 and the fitness values v used in the context of this book. 



http : //en. wikipedia. org/wiki/Genetic_algorithm [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/Interactive_genetic_algorithm [accessed 2007-07-03] 

3 http://en.wikipedia.org/wiki/HBGA [accessed 2007-07-03] 
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Figure 3.1: The basic cycle of genetic algorithms. 



3.2 General Information 
3.2.1 Areas Of Application 



Some example areas of application of genetic algorithms are: 



Application 


References 


Scheduling 


[1275, 417, 1228, 160, 340, 339, 
341] 


Chemistry, Chemical Engineering 


[475, 2269, 474, 476, 531, 2127, 
1075, 1401] 


Medicine 


[319, 1900, 2278, 2117] 


Data Mining and Data Analysis 


[1424, 1089, 834, 1991, 445] 


Geometry and Physics 


[366, 367, 966, 1222, 1223] 


Economics and Finance 


[2302] 


Networking and Communication 


[628, 1861, 1220, 290, 1164, 
2324] 

see Section 23.2 on page 401 




Electrical Engineering and Circuit Design 


[1304, 1305, 1306] 
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Image Processing 
Combinatorial Optimization 



[25] 

[1480, 1134, 1754, 2020, 2323, 
32] 



3.2.2 Conferences, Workshops, etc. 

Some conferences, workshops and such and such on genetic algorithms are: 



EUROGEN: Evolutionary Methods for Design Optimization and Control with Applications 

to Industrial Problems 
see Section 2.2.2 on page 106 

FOG A: Foundations of Genetic Algorithms 
http : / /www . sigevo . org/ [accessed 2007-09-01] 
History: 2007: Mexico City, Mexico, see [1960] 

2005: Aizu-Wakamatsu City, Japan, see [2259] 
2002: Torremolinos, Spain, see [519] 
2000: Charlottesville, VA, USA, see [1927] 
1998: Madison, WI, USA, see [139] 
1996: San Diego, CA, USA, see [172] 
1994: Estes Park, Colorado, USA, see [2214] 
1992: Vail, Colorado, USA, see [2209] 
1990: Bloomington Campus, Indiana, USA, see [1924] 
FWGA: Finnish Workshop on Genetic Algorithms and Their Applications 
NWGA: Nordic Workshop on Genetic Algorithms 
History: 1997: Helsinki, Finland, see [30] 
1996: Vaasa, Finland, see [29] 
1995: Vaasa, Finland, see [28] 
1994: Vaasa, Finland, see [27] 
1992: Espoo, Finland, see [26] 
GALESIA: International Conference on Genetic Algorithms in Engineering Systems: Inno- 
vations and Applications 
now part of CEC, see Section 2.2.2 on page 105 
History: 1997: Glasgow, UK, see [990] 
1995: Scheffield, UK, see [2309] 
GECCO: Genetic and Evolutionary Computation Conference 

see Section 2.2.2 on page 107 
GEM: International Conference on Genetic and Evolutionary Methods 
History: 2008: Las Vegas, Nevada, USA, see [81] 
2007: Las Vegas, Nevada, USA, see [80] 
ICG A: International Conference on Genetic Algorithms 
Now part of GECCO, see Section 2.2.2 on page 107 
History: 1997: East Lansing, Michigan, USA, see [98] 
1995: Pittsburgh, PA, USA, see [636] 
1993: Urbana-Champaign, IL, USA, see [730] 
1991: San Diego, CA, USA, see [170] 
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1989: Fairfax, Virginia, USA, see [1820] 
1987: Cambridge, MA, USA, see [857] 
1985: Pittsburgh, PA, USA, see [856] 
ICANNGA: International Conference on Adaptive and Natural Computing Algorithms 

see Section 2.2.2 on page 108 
Mendel: International Conference on Soft Computing 
see Section 1.6.2 on page 90 



3.2.3 Online Resources 

Some general, online available ressources on genetic algorithms are: 



http : //www. obitko . com/tut or ials/genetic-algorithms/ [.cc™od 2008-05-17] 
Last update: 1998 

Description: A very thorough introduction to genetic algorithms by Marek Obitko 
http : //www. aaai . org/AITopics/html/genalg.html [acceded 2008-05-17] 
Last update: up-to-date 

Description: The genetic algorithms and Genetic Programming pages of the AAAI 
http : //www . illigal . uiuc . edu/ web/ [accessed 2008-05-17] 
Last update: up-to-date 

Description: The Illinois Genetic Algorithms Laboratory (IlliGAL) 

http : //www. cs . emu. edu/Groups/AI/html/f aqs/ai/genetic/top .html [accessed 2008-05-17] 
Last update: 1997-08-10 
Description: The Genetic Algorithms FAQ. 

http : //www . rennard . org/alif e/ english/gavintrgb . html [accessed 2008-05-17] 
Last update: 2007-07-10 

Description: An introduction to genetic algorithms by Jean-Philippe Rennard. 
http://www.optiwater.com/GAsearch/ [accessed 2008-06-08] 
Last update: 2003-11-15 

Description: GA-Search - The Genetic Algorithms Search Engine 



3.2.4 Books 

Some books about (or including significant information about) genetic algorithms are: 



Goldberg [821]: Genetic Algorithms in Search, Optimization and Machine Learning 

Mitchell [1431]: An Introduction to Genetic Algorithms 

Davis [495]: Handbook of Genetic Algorithms 

Haupt and Haupt [905]: Practical Genetic Algorithms 

Gen and Cheng [787]: Genetic Algorithms and Engineering Design 

Chambers [368]: Practical Handbook of Genetic Algorithms: Applications 

Chambers [369]: Practical Handbook of Genetic Algorithms: New Frontiers 

Chambers [370]: Practical Handbook of Genetic Algorithms: Complex Coding Systems 

Holland [940]: Adaptation in Natural and Artificial Systems 

Gen and Chen [786]: Genetic Algorithms (Engineering Design and Automation) 

Cant'u-Paz [330]: Efficient and Accurate Parallel Genetic Algorithms 

Heistermann [915]: Genetische Algorithmen. Theorie und Praxis evolutiondrer Optimierung 
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Schoneburg, Heinzmann, and Feddersen [1831]: Genetische Algorithmen und Evolution- 
sstrategien 

Gwiazda [873]: Crossover for single- objective numerical optimization problems 
Schaefer and Telega [1819]: Foundations of Global Genetic Optimization 
Karr and Freeman [1093]: Industrial Applications of Genetic Algorithms 
Back [99]: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolution- 
ary Programming, Genetic Algorithms 
Davis [494]: Genetic Algorithms and Simulated Annealing 
Alba and Dorronsoro [33]: Cellular Genetic Algorithms 



3.3 Genomes in Genetic Algorithms 

Most of the terminology which we have defined in Section 1.3 and used throughout this 
book stems from the GA sector. The search spaces G of genetic algorithms, for instance, are 
referred to genome and its elements are called genotypes. Genotypes in nature encompass 
the whole hereditary information of an organism encoded in the DNA . The DNA is a string 
of base pairs that encodes the phenotypical characteristics of the creature it belongs to. Like 
their natural prototypes, the genomes in genetic algorithms are strings, linear sequences of 
certain data types [821, 945, 1431]. Because of the linear structure, these genotypes are also 
often called chromosomes. In genetic algorithms, we most often use chromosomes which are 
strings of one and the same data type, for example bits or real numbers. 

Definition 3.1 (String Chromosome). A string chromosome can either be a fixed-length 
tuple (Equation 3.1) or a variable-length list (Equation 3.2). 

In the first case, the loci i of the genes gi are constant and, hence, the tuples may contain 
elements of different types G,. 



This is not given in variable-length string genomes. Here, the positions of the genes may 
shift when the reproduction operations are applied. Thus, all elements of such genotypes 
must have the same type Gt- 



String chromosomes are normally bit strings, vectors of integer numbers, or vectors of real 
numbers. Genetic algorithms with numeric vector genomes in their natural representation, 
i. e., where G = X C R™ are called real-encoded [1107]. Today, more sophisticated methods 
for evolving good strings (vectors) of (real) numbers exist (such as Evolution Strategies, 
Differential Evolution, or Particle Swarm Optimization) than processing them like binary 
strings with the standard reproduction operations of GAs. 

Bit string genomes are sometimes complemented with the application of gray coding - ' 
during the genotype-phenotype mapping. This is done in an effort to preserve locality (see 
Section 1.4.3) and ensure that small changes in the genotype will also lead to small changes in 
the phenotypes [349] . Collins and Eaton [430] studied different encodings for GAs and found 
that their E-code outperform both gray and direct binary coding in function optimization. 
Messy genomes (see Section 3.7) where introduced to improve locality by linkage learning. 

Genetic algorithms are the original prototype of evolutionary algorithms and therefore, 
fully adhere to the description given in Section 2.1.2. They provide search operators which 
closely copy sexual and asexual reproduction schemes from nature. In such "sexual" search 

4 You can find an illustration of the DNA in Figure 1.14 on page 42 

5 http://en.wikipedia.org/wiki/Gray_coding [accessed 2007-07-03] 



G = {V [g[i], g[2], ..,g[n]) : g[i\ e G 4 Vi G l..n} 



(3.1) 



G = {V lists g : g[i\ e G T VO < i < len(g)} 



(3.2) 
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operations, the genotypes of the two parents genotypes will recombine. In asexual reproduc- 
tion, mutations are the only changes that occur. It is very common to apply both principles 
in conjunction, i.e., to first recombine two elements from the search space and subsequently, 
make them subject to mutation. 

In nature, life begins with a single cell which divides 6 time and again until a mature 
individual is formed 7 after the genetic information has been reproduced. The emergence 
of a phenotype from its genotypic representation is called embryogenesis in biology and 
its counterparts in evolutionary search are the genotype-phenotype mapping and artificial 
cmbryogeny which we will discuss in Section 3.8 on page 155. 

Let us shortly recapitulate the structure of the elements g of the search space G. A gene 
(see Definition 1.23 on page 43) is the basic informational unit in a genotype g. Depending 
on the genome, a gene can be a bit, a real number, or any other structure. In biology, a gene 
is a segment of nucleic acid that contains the information necessary to produce a functional 
RNA product in a controlled manner. An allele (see Definition 1.24) is a value of specific 
gene in nature and in EAs alike. The locus (see Definition 1.25) is the position where a 
specific gene can be found in a chromosome. Besides the functional genes and their alleles, 
there are also parts of natural genomes which have no (obvious) function [2161, 819]. The 
American biochemist Gilbert [806] coined the term intron for such parts. Similar structures 
can also be observed in evolutionary algorithms with variable-length encodings. 

Definition 3.2 (Intron). Parts of a genotype g e G that does not contribute to the 
phenotype x = gpm(g) are referred to as introns. 

Biological introns have often been thought of as junk DNA or "old code", i. e., parts 
of the genome that were translated to proteins in evolutionary past, but now are not used 
anymore. Currently though, many researchers assume that introns are maybe not as useless 
as initially assumed [467]. Instead, they seem to provide support for efficient splicing, for 
instance. The role of introns in genetic algorithms is as same as mysterious. They represent a 
form of redundancy - which is known to have possible as well as negative effects, as outlined 
in Section 1.4.5 on page 67 and Section 4.10.3. 

Figure 3.2 combines Figure 1.15 on page 45 and Figure 1.13 and illustrates the relations 
between the aforementioned entities in a bit string genome G = B 4 of the length 4, where two 
bits encode for one coordinate in a two-dimensional plane. Additional bits could appended 
to the genotypes because a variable-length representation is used for some strange reason, 
for instance. Then, these could occur as introns and would not influence the phenotype in 
the example. 



http : //en. wikipedia. org/wiki/Cell_division [accessed 2007-07-03] 

Matter of fact, cell division will continue until the individual dies. However, this is not important 
here. 

http://en.wikipedia.org/wiki/Intron [accessed 2007-07-05] 
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Figure 3.2: A four bit string genome G and a fictitious phenotype 



3.4 Fixed-Length String Chromosomes 



Especially widespread in genetic algorithms are search spaces based on fixed-length chro- 
mosomes. The properties of their crossover and mutation operations are well known and an 
extensive body of research on them is available [821, 945]. 



3.4.1 Creation: Nullary Reproduction 

Creation of fixed-length string individuals means simple to create a new tuple of the structure 
defined by the genome and initialize it with random values. In reference to Equation 3.1 on 
page 145, we could roughly describe this process with Equation 3.3. 

create^) = [g[i],g[2], ..,g[n]) : g[i] = G i [Lrando ro „()*iea(G ( )J] V« e l..n (3.3) 



3.4.2 Mutation: Unary Reproduction 

Mutation is an important method for preserving the diversity of the solution candidates by 
introducing small, random changes into them. In fixed-length string chromosomes, this can 
be achieved by randomly modifying the value (allele) of a gene, as illustrated in Fig. 3. 3. a. 
Fig. 3.3.b shows the more general variant of this form of mutation where < n < len(g) 
locations in the genotype g are changed at once. In binary coded chromosomes, for example, 
these genes would be bits which can simply be toggled. For real-encoded genomes, modifying 
an clement gi can be done by replacing it with a number drawn from a normal distribution 
with expected value gi, like gf ew ~ N[gi, cr 2 ). 



Fig. 3. 3. a: Single-gene mutation. Fig. 3.3.b: Multi-gene mutation Fig. 3.3.c: Multi-gene mutation 

(a). (b). 



Figure 3.3: Value- altering mutation of string chromosomes. 
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3.4.3 Permutation: Unary Reproduction 

The permutation operation is an alternative mutation method where the alleles of two genes 
are exchanged as sketched in Figure 3.4. This, of course, makes only sense if all genes have 
similar data types. Permutation is, for instance, useful when solving problems that involve 
finding an optimal sequence of items, like the travelling salesman problem [1263, 78]. Here, a 
genotype g could encode the sequence in which the cities are visited. Exchanging two alleles 
then equals of switching two cities in the route. 



3.4.4 Crossover: Binary Reproduction 

Amongst all evolutionary algorithms, genetic algorithms have the recombination operation 
which probably comes closest to the natural paragon. Figure 3.5 outlines the recombination 
of two string chromosomes, the so-called crossover, which is performed by swapping parts 
of two genotypes. 

When performing single-point crossover (SPX 9 ), both parental chromosomes are split 
at a randomly determined crossover point. Subsequently, a new child genotype is created 
by appending the second part of the second parent to the first part of the first parent as 
illustrated in Fig. 3. 5. a. In two-point crossover (TPX, sketched in Fig. 3.5.b), both parental 
genotypes are split at two points and a new offspring is created by using parts number one 
and three from the first, and the middle part from the second parent chromosome. Fig. 3.5.C 
depicts the generalized form of this technique: the n-point crossover operation, also called 
multi-point crossover (MPX). For fixed-length strings, the crossover points for both parents 
are always identical. 




Fig. 3. 5. a: Single-point Fig. 3.5.b: Two-point Fig. 3.5.c: Multi-point 

Crossover (SPX). Crossover (TPX). Crossover (MPX). 



Figure 3.5: Crossover (recombination) operators for fixed-length string genomes. 




Figure 3.4: Permutation applied to a string chromosome. 



This abbreviation is also used for simplex crossover, see Section 16.4. 
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3.5 Variable-Length String Chromosomes 

Variable-length genomes for genetic algorithms where first proposed by Smith in his PhD 
thesis [1912]. There, he introduced a new variant of classifier systems 10 with the goal of 
evolving programs for playing poker [1912, 1688]. 

3.5.1 Creation: Nullary Reproduction 

Variable-length strings can be created by first randomly drawing a length I > and then 
creating a list of that length filled with random elements. 

3.5.2 Mutation: Unary Reproduction 

If the string chromosomes are of variable length, the set of mutation operations introduced 
in Section 3.4 can be extended by two additional methods. First, we could insert a couple 
of genes with randomly chosen alleles at any given position into a chromosome (Fig. 3. 6. a). 
Second, this operation can be reversed by deleting elements from the string (Fig. 3.6.b). 
It should be noted that both, insertion and deletion, arc also implicitly be performed by 
crossover. Recombining two identical strings with each other can, for example, lead to dele- 
tion of genes. The crossover of different strings may turn out as an insertion of new genes 
into an individual. 

Since the reproduction operations can change the length of a genotypes (therefore the 
name "variable- length" ) , variable-length strings need to be constructed of elements of the 
same type. There is no longer a constant relation between locus and type. 




Fig. 3. 6. a: Insertion of random genes. Fig. 3.6.b: Deletion of genes. 

Figure 3.6: Search operators for variable-length strings (additional to those from Section 3.4.2 
and Section 3.4.3). 



3.5.3 Crossover: Binary Reproduction 

For variable-length string chromosomes, the same crossover operations are available as for 
fixed-length strings except that the strings are no longer necessarily split at the same loci. 
The lengths of the new strings resulting from such a cut and splice operation may differ 
from the lengths of the parents, as sketched in Figure 3.7. A special case of this type of 
recombination is the homologous crossover, where only genes at the same loci are exchanged. 
This method is discussed thoroughly in Section 4.6.7 on page 195. 



See Chapter 7 for more information on classifier systems. 



150 3 Genetic Algorithms 




Fig. 3.7.a: Single-Point 
Crossover 




Fig. 3.7.b: Two-Point 
Crossover 




Fig. 3.7.c: Multi-Point 
Crossover 



Figure 3.7: Crossover of variable-length string chromosomes. 



3.6 Schema Theorem 



The Schema Theorem is a special instance of forma analysis (discussed in Section 1.5.1 
on page 80) for genetic algorithms. Matter of fact, it is older than its generalization and 
was first stated by Holland back in 1975 [940, 512, 945]. Here we will first introduce the 
basic concepts of schemata, masks, and wildcards before going into detail about the Schema 
Theorem itself, its criticism, and the related Building Block Hypothesis. 



3.6.1 Schemata and Masks 

Assume that the genotypes g in the search space G of genetic algorithms are strings of 
a fixed- length I over an alphabet 11 S, i.e., G = S l . Normally, S is the binary alphabet 
E = {true, false} = {0, 1}. From forma analysis, we know that properties can be defined 
on the genotypic or the phenotypic space. For fixed-length string genomes, we can consider 
the values at certain loci as properties of a genotype. There are two basic principles on 
defining such properties: masks and do not care symbols. 

Definition 3.3 (Mask). For a fixed-length string genome G — S l , we define the set of all 
genotypic masks Mi as the power set 12 of the valid loci Mi = V({1, ...,/}) [2167]. Every 
mask m, € Mi defines a property (pi and an equivalence relation: 

g^^.ho g\j] = h\j] Vj € rrii (3.4) 

The order "order(mi)" of the mask rrii is the number of loci defined by it: 

order(mi) = |mj| (3-5) 

The defined length S(rrii) of a mask rrii is the maximum distance between two indices in 
the mask: 

S(rrii) = max{|j — k\ Vj, k S rrii} (3-6) 

A mask contains the indices of all elements in a string that are interesting in terms of the 
property it defines. Assume we have bit strings of the length I = 3 as genotypes (G = B 3 ). 
The set of valid masks M 3 is then M 3 = {{1} , {2} , {3} , {1, 3} , {1, 3} , {2, 3} , {1, 2, 3}}. The 
mask m 1 — {1,2}, for example, specifies that the values at the loci 1 and 2 of a genotype 
denote the value of a property <\>\ and the value of the bit at position 3 is irrelevant. There- 
fore, it defines four formae ^ 1= ( ,o) = {(0, 0, 0) , (0, 0, 1)}, ^ 1= (o,i) = {(0, 1, 0) , (0, 1, 1)}, 
^ 1=( i,o) = {(1, 0, 0) , (1, 0, 0)}, and A, 1={M) = {(1, 1, 0) , (1, 1, 1)}. 

Definition 3.4 (Schema). A forma defined on a string genome concerning the values of 
the characters at specified loci is called Schema [940, 389]. 

11 Alphabets and such and such are defined in Section 30.3 on page 561. 

12 The power set you can find described in Definition 27.9 on page 458. 



3.6 Schema Theorem 151 



3.6.2 Wildcards 



The second method of specifying such schemata is to use don't care symbols (wildcards) 
to create "blueprints" H of their member individuals. Therefore, we place the don't care 
symbol * at all irrelevant positions and the characterizing values of the property at the 
others. 



Vj G 1../ 



H = \g[]\ if 3 G m t 
y * otherwise 

Hy] e^uH Vj g 



(3.7) 



(3.8) 
(3.9) 



We now can redefine the aforementioned schemata like: A 



0i = (O,O) 



Hi = (0,0,: 



^0 1= (o,i) = H 2 = (0, 1, *), ^0 1= (i :O ) = H 3 = (1, 0, *), and ^ 1= ( M) = H 4 = (1, 1, *). These 
schemata mark hyperplanes in the search space G, as illustrated in Figure 3.8 for the three 
bit genome. Schemas correspond to masks and thus, definitions like the defined length and 
order can easily be transported into their context. 



H 2 =(0,1,*) 





(0,0,1) 




(1,0,1) 






(1,1,1) 


H^(1,0,*) 


(0,1,1) 




_^-H 5 =(l, 




H 4 =(l^*j 




go 


/(0,0,0) 




(1,0,0) 


'(0,1,0) 




1,1,0) 





Figure 3.8: An example for schemata in a three bit genome. 



3.6.3 Holland's Schema Theorem 

The Schema Theorem 13 was defined by Holland [940] for genetic algorithms which use 
fitness-proportionate selection (see Section 2.4.3 on page 124) where fitness is subject to 
maximization [512, 945]. 



countOccurences(ff, Po P ) t+1 > ^OccuiencesCg, Pop), *v{B) t (J _ p) ^ 
where 

1. countOccurences(-ff, Pop) t is the number of instances of a given schema defined by the 
blueprint H in the population Pop of generation t, 



http : //en. wikipedia. org/wiki/Holland/,27s_Schema_Theorem [accessed 2007-07-29] 
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2. v{H) t is the average fitness of the members of this schema (observed in time step t), 

3. v t is the average fitness of the population in time step t, and 

4. p is the probability that an instance of the schema will be "destroyed" by a reproduction 
operation, i. e., the probability that the offspring of an instance of the schema is not an 
instance of the schema. 

From this formula can be deduced that genetic algorithms will generate for short, above- 
average fit schemata an exponentially rising number of samples. This is because they will 
multiply with a certain factor in each generation and only few of them are destroyed by the 
reproduction operations. In the special case of single-point crossover (crossover rate cr) and 
single-bit mutation (mutation rate mr) in a binary genome of the fixed length I (G = B J ), 
the destruction probability p is noted in Equation 3.11. 

6(H) ordcr(H) 
p = crj^-t + mr ^-^ (3.11) 



3.6.4 Criticism of the Schema Theorem 



The deduction that good schemata will spread exponentially is only a very optimistic as- 
sumption and not generally true. If a highly fit schema has many offspring with good fit- 
ness, this will also improve the overall fitness of the population. Hence, the probabilities in 
Equation 3.10 will shift over time. Generally, the Schema Theorem represents a lower bound 
that will only hold for one generation [2208]. Trying to derive predictions for more than one 
or two generations using the Schema Theorem as is will lead to deceptive or wrong results 
[858, 854]. 

Furthermore, the population of a genetic algorithm only represents a sample of limited 
size of the search space G. This limits the reproduction of the schemata but also makes 
statements about probabilities in general more complicated. Since we only have samples 
of the schemata H and cannot be sure if v(H) t really represents the average fitness of all 
the members of the schema (that is why we annotate it with t instead of writing v(H)). 
Thus, even reproduction operators which preserve the instances of the schema may lead to 
a decrease of v(H) t+ by time. It is also possible that parts of the population already have 
converged and other members of a schema will not be explored anymore, so we do not get 
further information about its real utility. 

Additionally, we cannot know if it is really good if one specific schema spreads fast, even 
it is very fit. Remember that we have already discussed the exploration versus exploitation 
topic and the importance of diversity in Section 1.4.2 on page 60. 

Another issue is that we implicitly assume that most schemata are compatible and can 
be combined, i. e., that there is low interaction between different genes. This is also not 
generally valid: Epistatic effects, for instance, can lead to schema incompatibilities. The 
expressiveness of masks and blueprints even is limited and can be argued that there are 
properties which we cannot specify with them. Take the set D 3 of numbers divisible by 
three for example D 3 — {3, 6, 9, 12, ..}. Representing them as binary strings will lead to D3 = 
{0011, 0110, 1001, 1100, . . . } if we have a bit-string genome of the length 4. Obviously, we 
cannot seize these genotypes in a schema using the discussed approach. They may, however, 
be gathered in a forma. The Schema Theorem, however, cannot hold for such a forma since 
the probability p of destruction may be different from instance to instance. 



3.6.5 The Building Block Hypothesis 

According to Harik [896] , the substructure of a genotype which allows it to match to a schema 
is called a building block. The Building Block Hypothesis (BBH) proposed by Goldberg 
[821], Holland [940] is based on two assumptions: 
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1. When a genetic algorithm solves a problem, there exist some low-order, low-defining 
length schemata with above-average fitness (the so-called building blocks). 

2. These schemata are combined step by step by the genetic algorithm in order to form 
larger and better strings. By using the building blocks instead of testing any possible bi- 
nary configuration, genetic algorithms efficiently decrease the complexity of the problem. 
[821] 

Although it seems as if the Building Block Hypothesis is supported by the Schema 
Theorem, this cannot be verified easily. Experiments that originally were intended to proof 
this theory often did not work out as planned [1432] (and also consider the criticisms of the 
Schema Theorem mentioned in the previous section) . In general, there exists much criticism 
of the Building Block Hypothesis and, although it is a very nice model, it cannot yet be 
considered as proven sufficiently. 

3.7 The Messy Genetic Algorithm 

According to the schema theorem specified in Equation 3.10 and Equation 3.11, a schema 
is likely to spread in the population if it has above-average fitness, is short (i. e., low defined 
length) and is of low order [116]. Thus, according to Equation 3.11, from two schemas of the 
same average fitness and order, the one with the lesser defined length will be propagated 
to more offspring, since it is less likely to be destroyed by crossover. Therefore, placing 
dependent genes close to each other would be a search space design approach since it will 
allow good building blocks to proliferate faster. These building blocks, however, are not 
known at design time - otherwise the problem would already be solved. Hence, it is not 
generally possible to devise such a design. 

The messy genetic algorithms (mGAs) developed by Goldberg et al. [825] use a coding 
scheme which is intended to allow the genetic algorithm to re-arrange genes at runtime. 
It can place the genes of a building block spatially close together. This method of linkage 
learning may thus increase the probability that these building blocks, i.e., sets of epistatically 
linked genes, are preserved during crossover operations, as sketched in Figure 3.9. It thus 
mitigates the effects of epistasis as discussed in Section 1.4.6. 

| 1 Q O destroyed in 6 out of 9 cases by crossover 
^ rearrange 

00000 J]|0 DDD destroyed in 1 out of 9 cases by crossover 
Figure 3.9: Two linked genes and their destruction probability under single-point crossover. 



3.7.1 Representation 

The idea behind the genomes used in messy GAs goes back to the work Bagley 
[116] from 1967 who first introduced a representation where the ordering of the 
genes was not fixed. Instead, for each gene a tuple {<p, 7) with its position (lo- 
cus) <fi and value (allele) 7 was used. For instance, the bit string 000111 can be 
represented as g\ — ((0, 0) , (1, 0) , (2, 0) , (3, 1) , (4, 1) , (5, 1)) but as well as g 2 = 
((5, 1) , (1, 0) , (3, 1) , (2, 0) , (0, 0) , (4, 1)) where both genotypes map to the same phenotype, 
i. e., gpm(gi) = gpm(# 2 )- 
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3.7.2 Reproduction Operations 

Inversion: Unary Reproduction 

The inversion operator reverses the order of genes between two randomly chosen loci 
[116, 896]. With this operation, any particular ordering can be produced in a relatively 
small number of steps. Figure 3.10 illustrates, for example, how the possible building block 
components (1, 0), (3, 0), (4, 0), and (6, 0) can be brought together in two steps. Nevertheless, 
the effects of the inversion operation were rather disappointing [116, 741]. 
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Figure 3.10: An example for two subsequent applications of the inversion operation [896]. 



Cut: Unary Reproduction 

The cut operator splits a genotype g into two with the probability p c = (lcn(g) — I) pk where 
Pk is a bitwise probability and len(g) the length of the genotype [1153]. With pf. = 0.1, the 
,9i = ((0, 0) , (1, 0) , (2, 0) , (3, 1) , (4, 1) , (5, 1)) has a cut probability of p c = (6 - 1)*0.1 = 0.5. 
A cut at position 4 would lead to 53 = ((0, 0) , (1, 0) , (2, 0) , (3, 1)) and 34 = ((4, 1) , (5, 1)). 



3.7.3 Splice: Binary Reproduction 

The splice operator joins two genotypes with a predefined probability p s by simply attach- 
ing one to the other [1153]. Splicing g 2 = ((5, 1) , (1, 0) , (3, 1) , (2, 0) , (0, 0) , (4, 1)) and g 4 = 
((4, 1) , (5, 1)), for instance, leads to g 5 = ((5, 1) , (1, 0) , (3, 1) , (2, 0) , (0, 0) , (4, 1) , (4, 1) , (5, 1)) 
In summary, the application of two cut and a subsequent splice operation to two genotypes 
has roughly the same effect as a single-point crossover operator in variable-length string 
chromosomes Section 3.5.3. 



3.7.4 Overspecification and Underspecification 

The genotypes in messy GAs have a variable length and the cut and splice operators can lead 
to genotypes being over or underspecified. If we assume a three bit genome, the genotype g 6 — 
((2, 0) , (0, 0) , (2, 1) , (1, 0)) is overspecified since it contains two (in this example, different) 
alleles for the third gene (at locus 2). gj — ((2, 0) , (0, 0)), in turn, is underspecified since it 
does not contain any value for the gene in the middle (at locus 1). 

Dealing with overspecification is rather simple [1153, 608]: The genes are processed from 
left to right during the genotype-phenotype mapping, and the first allele found for a specific 
locus wins. In other words, g$ from above codes for 000 and the second value for locus 2 is 
discarded. The loci left open during the interpretation of underspecified genes are filled with 
values from a template string [1153]. If this string was 000, g-j would code for 000, too. 
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3.7.5 The Process 

In a simple genetic algorithm, building blocks are identified and recombined simultaneously, 
which leads to a race between recombination and selection [896]. In the messy GA [825, 826], 
this race is avoided by separating the evolutionary process into two stages: 

1. In the primordial phase, building blocks are identified. In the original conception of 
the messy GA, all possible building blocks of a particular order k are generated. Via 
selection, the best ones are identified and spread in the population. 

2. These building blocks are recombined with the cut and splice operators in the subsequent 
juxtapositional phase. 

The complexity of the original mGA needed a bootstrap phase in order to identify the 
order- A: building blocks which required to identify the order- A: — 1 blocks first. This boot- 
strapping was done by applying the primordial and juxtapositional phases for all orders from 
1 to k — 1. This process was later improved by using a probabilistic complete initialization 
algorithm [828] instead. 
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As already stated a dozen times by now, genetic algorithms use string genomes to encode 
the phenotypes x that represent the possible solutions. These phenotypes, however, do not 
necessarily need to be one-dimensional strings too. Instead, they can be construction plans, 
circuit layouts, or trees . The process of translating genotypes into corresponding pheno- 
types is called genotype-phenotype mapping and has been introduced in Definition 1.30 on 
page 44. 

Embryogenesis is the natural process in which the embryo forms and develops and 
to which the genotype-phenotype mapping in genetic algorithms and Genetic Programming 
corresponds. Most of even the more sophisticated of these mappings are based on an implicit 
one-to-one relation in terms of complexity. In the Grammar-guided Genetic Programming 
approach Gads 16 , for example, a single gene encodes (at most) the application of a single 
grammatical rule, which in turn unfolds a single node in a tree. 

Embryogeny in nature is much more complex. Among other things, the DNA, for in- 
stance, encodes the structural design information of the human brain. As pointed out by 
Manos et al. [1358], there are only about 30 thousand active genes in the human genome 
(2800 million amino acids) for over 100 trillion neural connections in our cerebrum. A huge 
manifold of information is hence decoded from "data" which is of a much lower magnitude. 
This is possible because the same genes can be reused in order to repeatedly create the same 
pattern. The layout of the light receptors in the eye, for example, is always the same - just 
their wiring changes. 

Definition 3.5 (Artificial Embryogeny). We subsume all methods of transforming 
a genotype into a phenotype of (much) higher complexity under the subject of artificial 
embryogeny [1358, 1957, 192] (also known as computational embryogeny [1221, 259]). 

Two different approaches are common in artificial embryogeny: constructing the phe- 
notype by using a grammar to translate the genotype and expanding it step by step until 

14 See for example Section 4.5.6 on page 181 

15 http://en.wikipedia.org/wiki/Embryogenesis [accessed 2007-07-03] 

16 See Section 4.5.5 on page 179 for more details. 
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a terminal state is reached or simulating chemical processes. Both methods may also re- 
quire subsequent correction steps that ensure that the produced results are correct, which is 
also common in normal genotype-phenotype mappings [2295]. An example for gene reuse is 
the genotype-phenotype mapping performed in Grammatical Evolution which is discussed 
in Section 4.5.6 on page 182. 



4 



Genetic Programming 



4.1 Introduction 

The term Genetic Programming 1 (GP) [1196, 916] has two possible meanings. First, it is often 
used to subsume all evolutionary algorithms that have tree data structures as genotypes. 
Second, it can also be defined as the set of all evolutionary algorithms that breed programs 2 , 
algorithms, and similar constructs. In this chapter, we focus on the latter definition which 
still includes discussing tree-shaped genomes. 

The conventional well-known input-processing-output model 3 from computer science 
states that a running instance of a program uses its input information to compute and 
return output data. In Genetic Programming, usually some inputs or situations and corre- 
sponding output data samples arc known or can be produced or simulated. The goal then 
is to find a program that connects them or that exhibits some kind of desired behavior 
according to the specified situations, as sketched in Figure 4.1. 



4.1.1 History 

The history of Genetic Programming [63] goes back to the early days of computer science. 
In 1957, Friedberg [750] left the first footprints in this area by using a learning algorithm 
to stepwise improve a program. The program was represented as a sequence of instructions 1 
for a theoretical computer called Herman [750, 751]. Friedberg did not use an evolutionary, 
population-based approach for searching the programs. This may be because the idea of 

1 http://en.wikipedia.org/wiki/Genetic_programming [accessed 2007-07-03] 

2 We have extensively discussed the topic of algorithms and programs in Section 30.1.1 on page 547. 

3 see Section 30.1.1 on page 549 

4 Linear Genetic Programming is discussed in Section 4.6 on page 191. 




samples are known 



to be found with genetic programming 



Figure 4.1: Genetic Programming in the context of the IPO model. 
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evolutionary algorithms wasn't fully developed yet 5 and also because of the limited compu- 
tational capacity of the computers of that era. 

Around the same time, Samuel applied machine learning to the game of checkers and by 
doing so, created the world's first self-learning program. In the future development section 
of his 1959 paper [1795], he suggested that effort could be spent into allowing the (checkers) 
program to learn scoring polynomials - an activity which would equal symbolic regression. 
Yet, in his 1967 follow-up work [1797], he could not report any progress in this issue. 

The evolutionary programming approach for evolving finite state machines by Fogel et al. 
[708], discussed in Chapter 6 on page 231, dates back to 1966. In order to build predictors, 
different forms of mutation (but no crossover) were used for creating offspring from successful 
individuals. 

Fourteen years later, the next generation of scientists began to look for ways to evolve 
programs. New results were reported by Smith [1912] in his PhD thesis in 1980. Forsyth 
[733] evolved trees denoting fully bracketed Boolean expressions for classification problems 
in 1981 [733, 735, 734]. 

The mid-1980s were a very productive period for the development of Genetic Program- 
ming. Cramer [462] applied a genetic algorithm in order to evolve a program written in a 
subset of the programming language PL in 1985. () This GA used a string of integers as genome 
and employed a genotype-phenotype mapping that recursively transformed them into pro- 
gram trees. At the same time, the undergraduate student Schmidhuber [1828] also used a 
genetic algorithm to evolve programs at the Siemens AG. He re-implemented his approach 
in Prolog at the TU Munich in 1987 [562, 1828]. Hicklin [924] and Fujuki [754] implemented 
reproduction operations for manipulating the if-then clauses of LISP programs consisting of 
single COND-statcments. With this approach, Fujiko and Dickinson [753] evolved strategies 
for playing the iterated prisoner's dilemma game. Bickel and Bickel [206] evolved sets of 
rules which were represented as trees using tree-based mutation crossover operators. 

Genetic Programming became fully accepted at the end of this productive decade mainly 
because of the work of Koza [1183, 1184]. He also studied many benchmark applications of 
Genetic Programming, such as learning of Boolean functions [1190, 1185], the Artificial Ant 
problem 7 [1188, 1187, 1196], and symbolic regression 8 [1190, 1196], a method for obtaining 
mathematical expressions that match given data samples. Koza formalized (and patented 
[1183, 1194]) the idea of employing genomes purely based on tree data structures rather than 
string chromosomes as used in genetic algorithms. In symbolic regression, such trees can, for 
instance, encode Lisp S-cxpressions 9 where a node stands for a mathematical operation and 
its child nodes are the parameters of the operation. Leaf nodes then are terminal symbols 
like numbers or variables. This form of Genetic Programming is called Standard Genetic 
Programming or SGP, in short. With it, not only mathematical functions but also more 
complex programs can be expressed as well. 

Generally, a tree can represent a rule set [1389, 1390], a mathematical expressions, a 
decision tree [1193], or even the blueprint of an electrical circuit [1082]. Trees are very 
close to the natural structure of algorithms and programs. The syntax of most of the high- 
level programming languages, for example, leads to a certain hierarchy of modules and 
alternatives. Not only does this form normally constitute a tree - compilers even use tree 
representations internally. When reading the source code of a program, they first split it into 
tokens 10 , parse 11 these tokens, and finally create an abstract syntax tree 12 (AST) [1065, 961]. 
The internal nodes of ASTs are labeled by operators and the leaf nodes contain the operands 

5 Compare with Section 3.1 on page 141. 

6 Cramer's approach is discussed in Section 4.4.1 on page 171. 

7 The Artificial Ant is discussed in Section 21.3.1 on page 354 in this book. 

8 More information on symbolic regression is presented in Section 23.1 on page 397 in this book. 

9 List S-expressions are discussed in Section 30.3.11 on page 571 

10 http://en.wikipedia.org/wiki/Lexical_analysis [accessed 2007-07-03] 

11 http://en.wikipedia.org/wiki/Parse_tree [accessed 2007-07-03] 

12 http://en.wikipedia.org/wiki/Abstract_syntax_tree [accessed 2007-07-03] 
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Pop = createPop(s) 



Input: s the size of the population to be created 

Data: i a counter variable 

Output: Pop the new, random population 



Pop <- () 
i < — s 

while i>0 do 

Pop * — appcndList(Pop, createQ) 
i «— i-1 



return Pop 



Algorithm 



1 List<IIndividual> createPop(s) { 

2 List<lndividual> Xpop; 

3 Xpop = new ArrayList<Ilndi vidual>(s) ; 

4 for(int i=s; i>0; i— ) { 

5 xpop.add(createO) ; 

6 } 

7 return Xpop; 

8 } 



Program 

(Schematic Java, High-Level Language) 




appendLi st 



J \ J- \ 

Pop create i - 

A 

Abstract Syntax Tree Representation 1 1 
Figure 4.2: The AST representation of algorithms/programs. 



of these operators. In principle, we can illustrate almost every 13 program or algorithm as 
such an AST (see Figure 4.2). 

Tree-based Genetic Programming directly evolves individuals in this form, which also 
provides a very intuitive representation for mathematical functions for which it has initially 
been used for by Koza. Another interesting aspect of the tree genome is that it has no natu- 
ral role model. While genetic algorithms match their direct biological metaphor particularly 
well, Genetic Programming introduces completely new characteristics and traits. Genetic 
Programming is one of the few techniques that are able to learn solutions of potentially 
unbound complexity. It can be considered as more general than genetic algorithms, because 
it makes fewer assumptions about the structure of possible solutions. Furthermore, it of- 
ten offers white-box solutions that are human-interpretable. Other optimization approaches 
like artificial neural networks, for example, generate black-box outputs, which are highly 
complicated if not impossible to fully grasp [1382]. 

13 Excluding such algorithms and programs that contain jumps (the infamous "goto") that would 
produce crossing lines in the flowchart (http://en.wikipedia.org/wiki/Flowchart [accessed 2007- 

07-03]). 
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4.2 General Information 
4.2.1 Areas Of Application 

Some example areas of application of Genetic Programming are: 



Application 



References 



Symbolic Regression and Function Synthesis 

Grammar Induction 

Data Mining and Data Analysis 

Electrical Engineering and Circuit Design 
Medicine 

Economics and Finance 
Geometry and Physics 

Cellular Automata and Finite State Machines 
Automated Programming 

Robotics 

Networking and Communication 



Evolving Behaviors, e.g., for Agents or Game Players 

Pattern Recognition 
Biochemistry 
Machine Learning 



[1190, 1196, 87, 2270, 1699, 

1196, 17, 528] 

Section 23.1 

[1042, 1394, 465, 1174] 

[1186, 744, 1592, 1593, 242, 445, 

1193, 2253, 332] 

Section 22.1.2 

[1082, 1182, 1080, 1206, 1205, 

1211, 1669, 506] 

[2055, 270, 243, 956] 

[1191, 1513, 1674, 1577] 

[1307, 2277] 

[58, 59, 508, 509] 

[140, 1242, 1324, 1325, 1317, 

1212] 

[1201, 1202, 1204, 986, 1317, 57, 

1576, 986, 1323] 

[434, 504, 2180, 1257, 1887, 



Section 24.1 on page 413 
and Section 23.2 on page 401 
[1187, 179, 180, 1688, 1686, 
1687, 907, 909, 55, 54, 1933, 67, 
1492, 984, 987, 985, 986, 1340, 
1341, 1342, 2194, 1323] 
[53, 56, 2015, 2014, 2016] 
[1200, 1199] 
[1203, 863] 



See also Section 4.4.3 on page 174, Section 4.5.6 on page 184, and Section 4.7.4 on page 201. 



4.2.2 Conferences, Workshops, etc. 



Some conferences, workshops and such and such on Genetic Programming are: 



EuroGP: European Conference on Genetic Programming 
http://www.evostar.org/ [acceded 2007-09-05] 
Co-located with Evo Workshops and EvoCOP. 
History: 2009: Tubingen, see [2106] 

2008: Naples, Italy, see [1579] 
2007: Valencia, Spain, see [617] 
2006: Budapest, Hungary, see [429] 
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2005: Lausanne, Switzerland, see [1116] 

2004: Coimbra, Portugal, see [1115] 

2003: Essex, UK, see [1786] 

2002: Kinsale, Ireland, see [737] 

2001: Lake Como, Italy, see [1423] 

2000: Edinburgh, Scotland, UK, sec [1666] 

1999: Goteborg, Sweden, see [1664] 

1998: Paris, France, see [141, 1663] 
GECCO: Genetic and Evolutionary Computation Conference 

see Section 2.2.2 on page 107 
GP: Annual Genetic Programming Conference 
Now part of GECCO, see Section 2.2.2 on page 107 
History: 1998: Madison, Wisconsin, USA, see [1209, 1198] 

1997: Stanford University, CA, USA, see [1208, 1956] 

1996: Stanford University, CA, USA, see [1207, 1197] 
GPTP: Genetic Programming Theory Practice Workshop 
http : //www . cscs . umich . edu/ gptp-workshops/ [acceded 2007-09-28] 
History: 2007: Ann Arbor, Michigan, USA, see [1945] 

2006: Ann Arbor, Michigan, USA, sec [1735] 

2005: Ann Arbor, Michigan, USA, see [2298] 

2004: Ann Arbor, Michigan, USA, see [1583] 

2003: Ann Arbor, Michigan, USA, see [1734] 
ICANNGA: International Conference on Adaptive and Natural Computing Algorithms 

see Section 2.2.2 on page 108 
Mendel: International Conference on Soft Computing 
see Section 1.6.2 on page 90 



4.2.3 Journals 

Some journals that deal (at least partially) with Genetic Programming are: 

Genetic Programming and Evolvable Machines (GPEM), ISSN: 1389-2576 (Print) 1573-7632 
(Online), appears quaterly, editor(s): Wolfgang Banzhaf, publisher: Springer Netherlands, 
http: //springerlink. met apress . com/content/ 104755/ [accessed 2007-09-28] 



4.2.4 Online Resources 

Some general, online available ressources on Genetic Programming are: 



http: //www. genetic-programming. org/ [accessed 2007-09-20] and http: //www. 
genetic-programming.com/ [accessed 2007-09-20] 
Last update: up-to-date 

^ . . Two portal pages on Genetic Programming websites, both maintained by 
Description: ^ Qz& 

http : //www . cs . bham . ac . uk/~wbl/biblio/ [accessed 2007-09-16] 

Last update: up-to-date 

Description: Langdon's large Genetic Programming bibliography. 

http : //www . lulu . com/items/volume_63/2167000/2167025/2/print/book . pdf [accessed 

2008-03-26] 
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Last update: up-to-date 

Description: A Field Guide to Genetic Programming, see [1667] 
http : //www . aaai . org/AITopics/html/genalg.html [accessed 2008-05-17] 
Last update: up-to-date 

Description: The genetic algorithms and Genetic Programming pages of the AAAI 
http : //www . cs . ucl . ac.uk/ staf f /W . Langdon/www_links . html [accessed 2008-05-is] 
Last update: 2007-07-28 

Description: William Langdon's Genetic Programming contacts 



4.2.5 Books 

Some books about (or including significant information about) Genetic Programming are: 

Koza [1196]: Genetic Programming, On the Programming of Computers by Means of Natural 
Selection 

Poli, Langdon, and McPhee [1667]: A Field Guide to Genetic Programming 
Koza [1195]: Genetic Programming II: Automatic Discovery of Reusable Programs: Auto- 
matic Discovery of Reusable Programs 

Koza, Bennett III, Andre, and Keane [1210]: Genetic Programming III: Darwinian Invention 
and Problem Solving 

Koza, Keane, Streeter, Mydlowec, Yu, and Lanza [1212]: Genetic Programming IV: Routine 
Human- Competitive Machine Intelligence 

Langdon and Poli [1242]: Foundations of Genetic Programming 

Langdon [1238]: Genetic Programming and Data Structures: Genetic Programming + Data 
Structures = Automatic Programming! 

Banzhaf, Nordin, Keller, and Francone [140]: Genetic Programming: An Introduction - On 

the Automatic Evolution of Computer Programs and Its Applications 

Kinnear, Jr. [1140]: Advances in Genetic Programming, Volume 1 

Angeline and Kinnear, Jr [61]: Advances in Genetic Programming, Volume 2 

Spector, Langdon, O'Reilly, and Angeline [1936]: Advances in Genetic Programming, Volume 

3 

Brameier and Banzhaf [275]: Linear Genetic Programming 

Wong and Leung [2253]: Data Mining Using Grammar Based Genetic Programming and 
Applications 

Geyer-Schulz [795]: Fuzzy Rule-Based Expert Systems and Genetic Machine Learning 
Spector [1932]: Automatic Quantum Computer Programming - A Genetic Programming 
Approach 

Nedjah, Abraham, and de Macedo Mourelle [1511]: Genetic Systems Programming: Theory 
and Experiences 



4.3 (Standard) Tree Genomes 

Tree-based Genetic Programming (TGP), usually referred to as Standard Genetic Program- 
ming, SGP) is the most widespread Genetic Programming variant, both for historical reasons 
and because of its efficiency in many problem domains. In this section, the well-known re- 
production operations applicable to tree genomes are outlined. 

4.3.1 Creation: Nullary Reproduction 

Before the evolutionary process can begin, we need an initial, randomized population. In 
genetic algorithms, we therefore simply created a set of random bit strings. For Genetic 
Programming, we do the same with trees instead of such one-dimensional sequences. 
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Normally, there is a maximum depth d specified that the tree individuals are not allowed 
to surpass. Then, the creation operation will return only trees where the path between the 
root and the most distant leaf node is not longer than d. There are three different ways for 
realizing the "createQ" operation (see Definition 2.9 on page 137) for trees which can be 
distinguished according to the depth of the produced individuals. 

AAA 

maximum depth 
Figure 4.3: Tree creation by the full method. 




maximum depth 



Figure 4.4: Tree creation by the grow method. 



The full method (Figure 4.3) creates trees where each (non-backtracking) path from the 
root to the leaf nodes has exactly the length d. The grow method depicted in Figure 4.4, 
also creates trees where each (non-backtracking) path from the root to the leaf nodes is not 
longer than d but may be shorter. This is achieved by deciding randomly for each node if 
it should be a leaf or not when it is attached to the tree. Of course, to nodes of the depth 
d — 1, only leaf nodes can be attached to. 

Koza [1196] additionally introduced a mixture method called ramped half-and-half. For 
each tree to be created, this algorithm draws a number r uniformly distributed between 2 
and d: (r = \random2d + lj ) . Now either full or grow is chosen to finally create a tree with 
the maximum depth r (in place of d). This method is often preferred since it produces an 
especially wide range of different tree depths and shapes and thus provides a great initial 
diversity. 

4.3.2 Mutation: Unary Reproduction 

Tree genotypes may undergo small variations during the reproduction process in the evo- 
lutionary algorithm. Such a mutation is usually defined as the random selection of a node 
in the tree, removing this node and all of its children, and finally replacing it with another 
node [1196]. From this idea, three operators can be derived: 

1. replacement of existing nodes randomly created ones (Fig. 4. 5. a), 

2. insertions of new nodes or small trees (Fig. 4.5.b), and 

3. the deletion of nodes, as illustrated in Fig. 4.5.C. 

The effects of insertion and deletion can also be achieved with replacement. 
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AAA M 

maximum depth maximum depth 

Fig. 4. 5. a: Sub-tree replacement. Fig. 4.5.b: Sub-tree insertions. 




maximum depth 
Fig. 4.5.c: Sub-tree deletion. 



Figure 4.5: Possible tree mutation operations. 



4.3.3 Recombination: Binary Reproduction 

The mating process in nature - the recombination of the genotypes of two individuals - 
is also copied in tree-based Genetic Programming. Applying the default sub-tree exchange 
recombination operator to two trees means to swap sub-trees between them as illustrated 
in Figure 4.6. Therefore, one single sub-tree is selected randomly from each of the parents 
and subsequently are cut out and reinserted in the partner genotype. Notice that, like in 
genetic algorithms, the effects of insertion and deletion operations can also be achieved by 
recombination. 

maximum depth 
Figure 4.6: Tree crossover by exchanging sub-trees. 



If a depth restriction is imposed on the genome, both, the mutation and the crossover 
operation have to respect them. The new trees they create must not exceed it. 

The intent of using the recombination operation in Genetic Programming is the same 
as in genetic algorithms. Over many generations, successful building blocks - for example a 
highly fit expression in a mathematical formula - should spread throughout the population 
and be combined with good genes of different solution candidates. Yet, recombination in 
Standard Genetic Programming can also have a very destructive effect on the individual 
fitness [1525, 1544, 140]. Angeline [62] even argues that it performs no better than mutation 
and causes bloat [65]. 
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Several techniques have been proposed in order to mitigate these effects. In 1994, 
D'Haeseleer [557] obtained modest improvements with his strong context preserving 
crossover that permitted only the exchange of sub-trees that occupied the same positions 
in the parents. Poli and Langdon [1661, 1662] define the similar single-point crossover for 
tree genomes with the same purpose: increasing the probability of exchanging genetic ma- 
terial which is structural and functional akin and thus decreasing the disruptiveness. A 
related approach define by Francone et al. [740] for linear Genetic Programming is discussed 
in Section 4.6.7 on page 195. 

4.3.4 Permutation: Unary Reproduction 

The tree permutation operation illustrated in Figure 4.7 resembles the permutation operation 
of string genomes or the inversion used in messy GA (Section 3.7.2, [1196]). Like mutation, 
it is used to reproduce one single tree. It first selects an internal node of the parental tree. 
The child nodes attached to that node are then shuffled randomly, i. e., permutated. If 
the tree represents a mathematical formula and the operation represented by the node is 
commutative, this has no direct effect. The main goal is to re-arrange the nodes in highly 
fit sub-trees in order to make them less fragile for other operations such as recombination. 
The effects of this operation are doubtable and most often it is not applied [1196]. 



4.3.5 Editing: Unary Reproduction 

Editing trees in Genetic Programming is what simplifying is to mathematical formulas. Take 
x = b + (7 — 4) + (1 * a) for instance. This expression clearly can be written in a shorter way 
be replacing (7—4) with 3 and (l*a) with a. By doing so, we improve its readability and also 
decrease the computational time for concrete values of a and b. Similar measures can often 
be applied to algorithms and program code. Editing a tree as outlined in Figure 4.8 means 
to create a new offspring tree which is more efficient but, in terms of functional aspects, 
equivalent to its parent. It is thus a very domain-specific operation. 




Figure 4.7: Tree permutation - (asexually) shuffling sub-trees. 




A 



7 4 



Figure 4.8: Tree editing - (asexual) optimization. 



A positive aspect of editing is that it usually reduces the number of nodes in a tree 
by removing useless expression, for instance. This makes it more easy for recombination 
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operations to pick "important" building blocks. At the same time, the expression (7 — 4) 
is now less likely to be destroyed by the reproduction processes since it is replaced by the 
single terminal node 3. 

On the other hand, editing also reduces the diversity in the genome which could degrade 
the performance by decreasing the variety of structures available. Another negative aspect 
would be if (in our example) a fitter expression was (7 — (4 * a)) and a is a variable close to 
1. Then, transforming (7 — 4) into 3 prevents a transition to the fitter expression. 

In Koza's experiments, Genetic Programming with and without editing showed equal 
performance, so this operation is not necessarily needed [1196]. 

4.3.6 Encapsulation: Unary Reproduction 

The idea behind the encapsulation operation is to identify potentially useful sub-trees and 
to turn them into atomic building block as sketched in Figure 4.9. To put it plain, we create 
new terminal symbols that (internally hidden) are trees with multiple nodes. This way, 
they will no longer be subject to potential damage by other reproduction operations. The 
new terminal may spread throughout the population in the further course of the evolution. 
According to Koza, this operation has no substantial effect but may be useful in special 
applications like the evolution of artificial neural networks [1196]. 




Figure 4.9: An example for tree encapsulation. 



4.3.7 Wrapping: Unary Reproduction 

Applying the wrapping operation means to first select an arbitrary node n in the tree. 
Additionally, we create a new non-terminal node m outside of the tree. In m, at least one 
child node position is left unoccupied. We then cut n (and all its potential child nodes) from 
the original tree and append it to m by plugging it into the free spot. Now we hang m into 
the tree position that formerly was occupied by n. 




Figure 4.10: An example for tree wrapping. 
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The purpose of this reproduction method illustrated in Figure 4.10 is to allow modi- 
fications of non-terminal nodes that have a high probability of being useful. Simple mu- 
tation would, for example, cut n from the tree or replace it with another expression. 
This will always change the meaning of the whole sub-tree below n dramatically, like for 
example in (6+3) + a — > (6*3) + a. By wrapping however, a more subtle change like 
(6+3) + a — > ((6+1) +3) + a is possible. 

The wrapping operation is introduced by the author - at least, I have not seen another 
source where it is used. 

4.3.8 Lifting: Unary Reproduction 

While wrapping allows nodes to be inserted in non-terminal positions with small change of 
the tree's semantic, lifting is able to remove them in the same way. It is the inverse operation 
to wrapping, which becomes obvious when comparing Figure 4.10 and Figure 4.11. 

A - A 

Figure 4.11: An example for tree lifting. 



Lifting begins with selecting an arbitrary inner node n of the tree. This node then replaces 
its parent node. The parent node inclusively all of its child nodes (except n) are removed 
from the tree. With lifting, a tree that represents the mathematical formula (6+ (1 — a)) * 3 
can be transformed to 6 * 3 in a single step. Lifting is used by the author in his experiments 
with Genetic Programming (sec for example Section 24.1.2 on page 414). I, however, have 
not yet found other sources using a similar operation. 

4.3.9 Automatically Defined Functions 

The concept of automatically defined functions (ADFs) introduced by Koza [1196] provides 
some sort of pre-specified modularity for Genetic Programming. Finding a way to evolve 
modules and reusable building blocks is one of the key issues in using GP to derive higher- 
level abstractions and solutions to more complex problems [66, 67, 1195]. If ADFs are used, 
a certain structure is defined for the genome. The root of the tree usually loses its functional 
responsibility and now serves only as glue that holds the individual together and has a fixed 
number n of children, from which n — 1 are automatically defined functions and one is the 
result-generating branch. When evaluating the fitness of an individual, often only this first 
branch is taken into consideration whereas the root and the ADFs are ignored. The result- 
generating branch, however, may use any of the automatically defined functions to produce 
its output. 

When ADFs are employed, typically not only their number must be specified beforehand 
but also the number of arguments of each of them. How this works can maybe best illustrated 
by using the example given in Figure 4.12. It stems from function approximation 14 , since 
this is the area where many early examples of the idea of ADFs come from. 

Assume that the goal of GP is to approximate a function g with the one parameter x 
and that a genome is used where two functions (/o and f\) are automatically defined, /o 



A very common example for function approximation, Genetic Programming-based symbolic re- 
gression, is discussed in Section 23.1 on page 397. 
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Figure 4.12: A concrete example for automatically defined functions. 



has a single formal parameter a and /i has two formal parameters a and b. The genotype 
Figure 4.12 encodes the following mathematical functions: 

ff(a0 = /i(4,/o(a0)*(/o(a:)+3) 
/o(o) = a + 7 
fi(a,b) = (-a) * b 

Hence, g(x) = ((—4) * (x + 7)) * ((x + 7) + 3). The number of children of the function 
calls in the result-generating branch must be equal to the number of the parameters of the 
corresponding ADF. 

Although ADFs were first introduced in symbolic regression by Koza [1196], they can 
also be applied to a variety of other problems like in the evolution of agent behaviors [1688, 
1686, 52, 55], electrical circuit design [1206], or the evolution of robotic behavior [57]. 

4.3.10 Automatically Defined Macros 

Spector's idea of automatically defined macros (ADMs) complements the ADFs of Koza 
[1928, 1929]. Both concepts are very similar and only differ in the way that their parameters 
are handled. The parameters in automatically defined functions are always values whereas 
automatically defined macros work on code expressions. This difference shows up only when 
side-effects come into play. 

In Figure 4.13, we have illustrated the pseudo-code of two programs - one with a function 
(called ADF) and one with a macro (called ADM). Each program has a variable x which is 
initially zero. The function y() has the side-effect that it increments x and returns its new 
value. Both, the function and the macro, return a sum containing their parameter a two 
times. The parameter of ADF is evaluated before ADF is invoked. Hence, x is incremented one 
time and 1 is passed to ADF which then returns 2=1+1. The parameter of the macro, however, 
is the invocation of y(), not its result. Therefore, the ADM resembles to two calls to y(), 
resulting in x being incremented two times and in 3=1+2 being returned. 

The ideas of automatically defined macros and automatically defined functions are very 
close to each other. Automatically defined macros are likely to be useful in scenarios where 
context-sensitive or side-effect-producing operators play important roles [1928, 1929]. In 
other scenarios, there is no much difference between the application of ADFs and ADMs. 
Finally, it should be mentioned that the concepts of automatically defined functions and 
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variable x=0 

subroutine y() 
begi n 
x++ 

return x 
end 

function 

ADF(param a) = (a+a) 

mai n_program 
begi n 

print("out: " + func(y)) 
end 



Program with ADF 



variable x=0 

subroutine y() 
begi n 
x++ 

return x 
end 

macro 

ADM(param a) = (a+a) 

mai n_program 
begi n 

print("out: " + ADM(y)) 
end 



Program with ADM 



.roughly resembles 



man n_program 
begi n 

variable tempi, temp2 

tempi = y() 

temp2 = tempi + tempi 

print("out: " + temp2) 

end 



c:Tl pro 


m i 


exec 

L__ 


main program 
"out 2" 





.roughly resembles 



mai n_program 
begi n 

variable temp 
temp = y() + y() 
print("out: " + temp) 
end 



exec mai n_program 
-> "out 3" 



Figure 4.13: Comparison of functions and macros. 



macros are not restricted to the standard tree genomes but are also applicable in other 
forms of Genetic Programming, such as linear Genetic Programming or PADO. 15 

4.3.11 Node Selection 

In most of the reproduction operations for tree genomes, in mutation as well as in recom- 
bination, certain nodes in the trees need to be selected. In order to apply the mutation, we 
first need to find the node which is to be altered. For recombination, we need one node in 
each parent tree. These nodes are then exchanged. The question how to select these nodes 
seems to be more or less irrelevant but plays an important role in reality. The literature 
most often speaks of "randomly selecting" a node but does not describe how exactly this 
should be done. 

A good method for doing so could select all nodes c and n in the tree t 
with exactly the same probability as done by the method "uniformSelectNode" , i. e., 
P(uniformSelectNode(i) — c) = P(uniformSelectNode(t) = n) Vs,n G t. 

15 Linear Genetic Programming is discussed in Section 4.6 on page 191 and a summary on PADO 
can be found in Section 4.7.1 on page 196. 
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Therefore, we define the weight node Weight (n) of a tree node n to be the total num- 
ber of nodes in the sub-tree with n as root, i. c., itself, its children, grandchildren, grand- 
grandchildren, etc. 

lcn(n . children) — 1 

nodeWeight(n) = 1 + ^ node Weight (n.children[i]) (4.1) 

i=0 

Thus, the node Weight of the root of a tree is the number of all nodes in the tree and 
the node Weight of each of the leaves is exactly 1. In uniformSelectNode, the probability for 
a node of being selected in a tree t is thus 1 /nodoWci g ht(t). We can create such a probability 
distribution by descending it from the root according to Algorithm 4.1. 



Algorithm 4.1: 



uniformSelectNode(i) 



l 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 

16 



Input: t: the (root of the) tree to select a node from 
Data: c: the currently investigated node 
Data: c. children: the list of child nodes of c 
Data: b, d: two Boolean variables 

Data: r: a value uniformly distributed in [0, nodeWeight(c)] 

Data: i: an index 

Output: n: the selected node 

begin 

b 



true 

c< t 

while b do 

r< — L ran dom„(0, node Weight (c)) J 
if r > node Weight (c) — 1 then b < — false 
else 

i < — len(c. children) — 1 
while i > do 

r < — r — nodeWeight (c.children[i}) 
if r < then 

c < — c.children[i] 

i < 1 

else 
I i< — i - 1 



return c 



17 end 



A tree descend where with probabilities different from these defined here will lead to 
unbalanced node selection probability distributions. Then, the reproduction operators will 
prefer accessing some parts of the trees while very rarely altering the other regions. We could, 
for example, descend the tree by starting at the root t and would return the current node 
with probability 0.5 or recursively go to one of its children (also with 50% probability). Then, 
the root t would have a 50 : 50 chance of being the starting point of reproduction operation. 
Its direct children have at most probability °- 52 /ien(t. children) each, and their children even 
0-5 I \en{t . children)\en{t . children[i\ . children) and so on. Hence, the leaves would almost never take 
actively part in reproduction. We could also choose other probabilities which strongly prefer 
going down to the children of the tree, but then, the nodes near to the root will most likely 
be left untouched during reproduction. Often, this approach is favored by selection methods, 
although leaves in different branches of the tree arc not chosen with the same probabilities 
if the branches differ in depth. When applying Algorithm 4.1 on the other hand, there exist 
no regions in the trees that have lower selection probabilities than others. 
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4.4 Genotype-Phenotype Mappings 

Genotype-phenotypc mappings (GPM, see Section 3.8 on page 155) are used in many differ- 
ent Genetic Programming approaches. Here we give a few examples about them. Many of 
the Grammar-guided Genetic Programming approaches discussed in Section 4.5 on page 176 
are based on similar mappings. 

4.4.1 Cramer's Genetic Programming 

It is interesting to see that the earliest Genetic Programming approaches were based on a 
genotype-phenotype mapping. One of them, dating back to 1985, is the method of Cramer 
[462]. His goal was to evolve programs in a modified subset of the programming language 
PL. Two simple examples for such programs, obtained from his work, are: 

;;Set variable VO to have the value of VI 
(:ZER0 VO) 

( : LOOP VI ( : INC VO) ) 

;; Multiply V3 by V4 and store the result in V5 
(:ZER0 V5) 

(:L00P V3 (:L00P V4 ( : INC V5))) 
Listing 4.1: Two examples for the PL dialect used by Cramer for Genetic Programming 

On basis of a genetic algorithm working on integer strings, he proposed two ideas on how 
to convert these strings to valid program trees. 

The JB Mapping 

The first approach was to divide the integer string into tuples of a fixed length which is large 
enough to hold the information required to encode an arbitrary instruction. In the case our 
examples, these are triplets where the first item identifies the operation, and the following 
two numbers define its parameters. Superfluous information, like a second parameter for a 
unary operation, is ignored. 

(0 4 2)^ (: BLOCK AS4 AS2) 

(1 6 0)^ (:L00P V6 ASO) 

(2 19)^ (:SET VI V9) 

(3 17 8) -» (:ZER0 V17) ; ; the 8 is ignored 

(4 5)^ (:INC VO) ; ; the 5 is ignored 

Listing 4.2: An example for the JB Mapping 

Here, the symbols of the form Vn and ASn represent variables and auxiliary statements, 
respectively. Cramer distinguishes between input variables providing data to a program and 
local (body) variables used for computation. Any of them can be chosen as output variable 
at the end of the execution. The multiplication program used in Listing 4.1 can now be 
encoded as (0 013581321434599 2) which translates to 

(0 1) ;; main statement — » (: BLOCK ASO AS1) 

(3 5 8) ; ;auxiliary statement — » ( : ZERO V5) 

(1 3 2) ;;auxiliary statement 1 — » ( : LOOP V3 AS2) 

(1 4 3) ; -.auxiliary statement 2 — » ( : LOOP V4 AS3) 

(4 5 9) ; .-auxiliary statement 3 — > ( : INC V5) 

Listing 4.3: Another example for the JP Mapping 

Cramer outlines some of the major problems of this representation, especially the strong 
positional epistasis 16 - the strong relation of the meaning of an instruction to its position. 
This epistasis makes it very hard for the genetic operations to work efficiently, i.e., to prevent 
destruction of the genotypes passed to them. 



We come back to positional epistasis in Section 4.8.1 on page 202. 



172 4 Genetic Programming 
The TB Mapping 

The TB mapping is essentially the same as the JB mapping, but reduces these problems a 
bit. Instead of using the auxiliary statement method as done in JB, the expressions in the 
TB language are decoded recursively. The string (0 (3 5) (1 3 (1 4 (4 5))) ), for instance, 
expands to the program tree illustrated in Listing 4.3. Furthermore, Cramer restricts mu- 
tation to the statements near the fringe of the tree, more specifically, to leaf operators that 
do not require statements as arguments and to non-leaf operations with leaf statements as 
arguments. Similar restrictions apply to crossover. 

4.4.2 Binary Genetic Programming 

With their Binary Genetic Programming (BGP) approach [136], Keller and Banzhaf [1119, 
1120, 1121] further explore the utility of explicit genotype-phenotype mappings and neutral 
variations in the genotypes. They called the genes in their fixed-length binary string genome 
codons analogously to molecular biology where a codon is a triplet of nucleic acids in the 
DNA 17 , encoding one amino acid at most. Each codon corresponds to one symbol in the 
target language. The translation of the binary string genotype g into a string representing 
an expression in the target language works as follows: 

1. x < — e 

2. Take the next gene (codon) g from g and translate it to the according symbol s. 

3. If s is a valid continuation of x, set x < — xos and continue in step 2. 

4. Otherwise, compute the set of symbols S that would be valid continuation of x. 

5. From this set, extract the set of (valid) symbols S' which have the minimal Hamming 
distance 18 to the codon g. 

6. From S' take the symbol s' which has the minimal codon value and append it to x: 
x < — xos'. 

After this mapping, x can still be an invalid expression since there maybe were not 
enough genes in g so the phenotype is incomplete, for example x = 3 * 4 — sin{v*. These 
incomplete sequences are fixed by consecutively appending symbols that lead to a quick end 
of an expression according to some heuristic. 

The genotype-phenotype mapping of Binary Genetic Programming represents a n : 1 
relation: Due to the fact that different codons may be replaced by the same approximation, 
multiple genotypes have the same phenotypic representation. This also means that there can 
be genetic variations induced by the mutation operation that do not influence the fitness. 
Such neutral variations are often considered as a driving force behind (molecular) evolution 
[1137, 1138, 973] and are discussed in Section 1.4.5 on page 67 in detail. 

From the form of the genome we assume the number of corrections needed in the 
genotype-phenotype mapping(especially for larger grammars) will be high. This, in turn, 
could lead to very destructive mutation and crossover operations since if one codon is mod- 
ified, the semantics of many subsequent codons may be influenced wildly. This issue is also 
discussed in Section 4.8.1 on page 204. 

4.4.3 Gene Expression Programming 

Gene Expression Programming (GEP) by Ferreira [654, 655, 656, 657, 658] introduces an 
interesting method for dealing with remaining unsatisfied function arguments at the end 
of the expression tree building process. Like BGP, Gene Expression Programming uses a 
genotype-phenotype mapping that translates fixed-length string chromosomes into tree phe- 
notypes representing programs. 



See Figure 1.14 on page 42 for more information on the DNA. 
see Definition 29.6 on page 537 
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A gene in GEP is composed of a head and a tail [654] which are further divided into 
codons, where each codon directly encodes one expression. The codons in the head of a 
gene can represent arbitrary expressions whereas the codons in the tail can only stand 
for parameterless terms. This makes the tail a reservoir for unresolved arguments of the 
expressions in the head. 

For each problem, the length h of the head is chosen as a fixed value, and the length of the 
tail t is defined according to Equation 4.2, where n is the arity (the number of arguments) 
of the function with the most arguments. 

t = h(n - 1) + 1 (4.2) 

The reason for this formula is that we have h expressions in the head, each of them 
taking at most n parameters. An upper bound for the total number of arguments is thus 
h * ft. From this number, h — 1 are already satisfied since all expressions in the head (except 
for the first one) themselves are arguments to expressions instantiated before. This leaves at 
most h* n — (h — 1) = h * n — h + 1 — h(n — 1) + 1 unsatisfied parameters. With this simple 
measure, incomplete expressions that require additional repair operations in BGP and most 
other approaches simply cannot occur. 

For instance, consider the grammar for mathematical expressions with the terminal sym- 
bols U — {V^j *i /j ~i + j a ; b} given as example in [654]. It includes two variables, a and b, 
as well as five mathematical functions, y/~^, *, /, +, and -. y/~^ has the arity 1 since it takes 
one argument, the other four have arity 2. Hence, n = 2. 
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Figure 4.14: A GPM example for Gene Expression Programming. 



Figure 4.14 illustrates an example gene (with h — 10 and t = h(2 — 1) + 1 = 11) and its 
phenotypic representation of this mathematical expression grammar. A phenotype is built 
by interpreting the gene as a level-order traversal 19 of the nodes of the expression tree. In 

19 http://en.wikipedia.org/wiki/Tree_traversal [accessed 2007-07-15] 
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other words, the first codon of a gene encodes the root r of expression tree (here +). Then, 
all nodes in the first level (i. e., the children of r, here if~^ and -) are stored from left to 
right, then their children and so on. In the phenotypic representation, we have sketched the 
traversal order and numbered the levels. These level numbers are annotated to the gene but 
are neither part of the real phenotypc nor the genotype. Furthermore, the division of the 
gene into head and tail is shown. In the head, the mathematical expressions as well as the 
variables may occur, while variables are the sole construction element of the tail. 

In GEP, multiple genes form one genotype, thus encoding multiple expression trees. 
These trees may then be combined to one phenotype by predefined statements. It is easy 
to see that binary or integer strings can be used as genome, because the number of allowed 
symbols is known in advance. 

This fixed mapping is also a disadvantage of Gene Expression Programming in com- 
parison with the methods introduced later which have variable input grammars. On the 
other hand, there is the advantage that all genotypes can be translated to valid expression 
trees without requiring any corrections. Another benefit is that it seems to circumvent - 
at least partially - the problem of low causality from which the string-to-tree-GPM based 
approaches in often suffer. By modularizing the genotypes, potentially harmful influences 
of the reproduction operations are confined to single genes while others may stay intact. 
(See Section 4.8.1 on page 204 for more details.) 

General Information 

Areas Of Application 

Some example areas of application of Gene Expression Programming are: 



Application 

Symbolic Regression and Function Synthesis 

Data Mining and Data Analysis 

Electrical Engineering and Circuit Design 
Machine Learning 
Geometry and Physics 
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Online Resources 

Some general, online available ressources on Gene Expression Programming are: 



http://www.gene-expression-programming.com/ [acceseed 2007-08-19] 
Last update: up-to-date 

^ . . Gene Expression Programming Website. Includes publications, tutorials, and 
Description: 

software. 



4.4.4 Edge Encoding 

Up until now, we only have considered how string genotypes can be transformed to more 
complex structures like trees. Obviously, genotype-phenotype mappings arc not limited to 
this, but can work on tree genotypes as well. In [1321], Luke and Spector present their 
edge encoding approach where the genotypes are trees (or forests) of expressions from a 
graph-definition language. During the GPM, these trees are interpreted and construct the 
phenotypes, arbitrary directed graphs. Edge encoding is closely related to Gruau's cellular 
encoding [863], which works on nodes instead of edges. 
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Each functions and terminals in edge encoding work on tuples (a, b) containing two node 
identifiers. Such a tuple represents a directed edge from node a to node b. The functions 
edit these tuples, add nodes or edges and thus, successively build the graph. Unlike normal 
Genetic Programming applications like symbolic regression, for instance, the nodes of trees 
in edge encoding are "executed" from top to bottom (pre-order) and pass control down to 
their children (from left to right). After an edge has been processed by a terminal node, 
it becomes permanent part of the graph constructed. In order to allow the construction of 
arbitrary graphs, an additional control structure, a stack of node identifiers, is used. Each 
node in the GP tree may copy this stack, modify this copy, and pass it to all of its children. 

In their paper [1321], Luke and Spcctor give multiple possible function and terminal sets 
for edge encoding. We provide a set that is sufficient to build arbitrary graphs in Table 4.1. 
Generally, each node receives an input edge tuple E = (a, b) and a stack s which it can 
process. The two commands labelE and labelN in the table are no real functions but just 
here to demonstrate how nodes and edges can be enriched with labels and other sorts of 
information. 
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create an edge F — (b, c) 
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pass E = (a, 6) and the stack s to the first child 
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pass F — (b, c) and the stack s to the second child 
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change edge E — (a, b) to E — (a, c) 
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create a new edge F — (b, b) 
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pass E = (a, b) and the stack s to the first child 
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pass F — (b, b) and the stack s to the second child 


cut 
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eliminate edge E = (a, b) 
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create two new edges F = (a, c) and G = (b, c) 






5. 


pass E = (a, b) and the new stack s' to the first child 






6. 


pass F — (a, c) and the new stack s' to the second child 






7. 


pass G — (b, c) and the new stack s' to the third child 


labelN 


1 


1. 


label node b from edge E — (a, b) with something 






2. 


pass E = (a, 6) and the stack s to the child 


labelE 


1 


1. 


label the edge E — (a, b) with something 






2. 


pass E = (a, b) and the stack s to the child 



Table 4.1: One possible operator set of edge encoding. 



In Figure 4.15, an example genotype for edge encoding is given. The nodes of this geno- 
type are annotated with the parameters which are passed to them by their parent. The root 
receives an initial node tuple and an empty stack. Notice that we have replaced the node 
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E= 0,1 , 
n push 



E=(0,1) 
8 =[2] 



push 



E=(0,1) 
s=[2,3] 



double 



E=(0,1) 
s=[2,3] spl it 



7\ 



E=(0,4) 
s=[2,3] 



attach 



G=(4,l) 
s=[2,3] 



nop 



attach 



F=(0,1) 
s=[2,3] 



ch nop cut nop* nop a t 

V|\ H=(0,3) G=(4,l) L=(4,3) / I \ 

/ I \ s=[2] s=[2] s=[2] s=[2] / I \ 



E=(0,4) 
s=[2] attach 



attach 



M=(l,3) 
S =[2] 



s=[2] s=[2] s=[2] s=[2] 
nop nop nop nop nop nop 



E=(0,4) J=(0,2) K=(4,2) 
s=[] s =[] s=[] 



M=(l,3) N=(l,2) 0=(3,2) 
s =[] s =[] s=D 



GPM 




Figure 4.15: An example for edge encoding. 



names a, b, c from Table 4.1 with running numbers and that new edges receive automatically 
a new name. At the bottom of the graphic, you find the result of the interpretation of the 
genotype by the GPM, a beautiful graph. 

Edge encoding can easily be extended with automatically defined functions (as also shown 
in [1321]) and gave the inspiration for Sinclair's node pair encoding method for evolving 
network topologies [1887] (discussed in ?? on page ??). Vaguely related to such a graph 
generating approach are some of the methods for deriving electronic circuits by Lohn et al. 
[1306] and Koza et al. [1205] which you can find listed in the "Applications" tables in the 
general information sections. 



4.5 Grammars in Genetic Programming 
4.5.1 Introduction 

We have learned that the most common genotypic and phcnotypic representations in Genetic 
Programming are trees and also have discussed the reproduction operations available for tree- 
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based genomes. In this discussion, we left one out important point: in many applications, 
reproduction cannot occur freely. Normally, there are certain restrictions to the structure 
and shape of the trees that must not be violated. Take our pet-example symbolic regression 20 
for instance. If we have a node representing a division operation, it will take two arguments: 
the dividend and the divisor. One argument is not enough and a third argument is useless, 
as one can easily see in Figure 4.16. 



invalid 



/ 

1 

invalid 




1 2 3 



• • • 



invalid 



Figure 4.16: Example for valid and invalid trees in symbolic regression. 



There are four general methods how to avoid invalid configurations under these limita- 
tions: 

1. Compensate illegal configurations during the evaluation of the objective functions. This 
would mean, for example, that a division with no arguments could return 1, a division 
with only the single argument a could return a, and that superfluous arguments (like c 
in Figure 4.16) would simply be ignored. 

2. A subsequent repair algorithm could correct errors in the tree structure that have been 
introduced during reproduction. 

3. Using additional checking and refined node selection algorithms, we can ensure that only 
valid trees are created during the reproduction cycle. 

4. With special genotype-phenotype mappings, we can prevent the creation of invalid trees 
from the start. 

In this section, we will introduce some general methods of enforcing valid configurations 
in the phenotypes, mostly regarding the fourth approach. A very natural way to express 
structural and semantic restrictions of a search space are formal grammars which are elab- 
orated on in Section 30.3 on page 561. Genetic Programming approaches that limit their 
phenotypes (the trees) to sentences of a formal language are subsumed under the topic of 
Grammar-guided Genetic Programming (GGGP, G3P) [1382]. 



4.5.2 Trivial Approach 

Standard Genetic Programming as introduced by Koza [1196] already inherently utilizes 
simple mechanisms to ensure the correctness of the tree structures. These mechanisms are 
rather trivial, though, and should not be counted to the family of GGGP approaches, but 
are mentioned here for the sake of completeness. 

In Standard Genetic Programming, all expressions have exactly the same type. Applied 
to symbolic regression, this means that, for instance, all constructs will be real- valued or 
return real values. If logical functions like multiplexers are grown, all entities will be Boolean- 
valued, and so on. For each possible tree node type, we just need to specify the exact 
amount of children. This approach corresponds to a context-free grammar 21 with a single 
non-terminal symbol which is expanded by multiple rules. Listing 4.4 illustrates such a trivial 
grammar G = (N,S,P,S) in Backus-Naur Form (BNF) 22 . Here, the non-terminal symbol 

20 See Section 23.1 on page 397. 

21 see Section 30.3.2 on page 563 for details 

22 The Backus-Naur form is discussed in Section 30.3.4 on page 564. 
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= (<Z> + <Z>) 

= (<Z> - <Z>) 

= (<Z> * <Z>) 

= (<Z> / <Z>) 

= (sin <Z>) 

= X 

Listing 4.4: A trivial symbolic regression grammar. 

is Z (AT = {Z}), the terminal symbols are £ = {(,),+,-,*,/, sin, X}, and six different 
productions are defined. The start symbol is S = Z. 

Standard Genetic Programming does not utilize such grammars directly. Rather, they are 
hard-coded in the reproduction operators or are represented in fixed internal data structures. 

Here we should mention that illegal configurations can also rise at runtime from seman- 
tics. In symbolic regression, a division operation is invalid if the divisor is zero, for instance. 
The same goes for logarithms, or a tangent of (n + |) it Vn 6 Z. All four approaches for 
enforcing a proper tree structure previously introduced cannot prevent such errors from the 
start. Therefore, the function set (the possible inner nodes of the trees) need to ensure the 
property of closure as defined by Koza [1196]. 

Definition 4.1 (Closure). If a function set N has the property closure, it ensures that all 
possible values are accepted as parameter by any function. 

Closure is especially important in approaches like symbolic regression, and can easily be 
achieved by redefining the mathematical functions for special cases, like setting g = a Va € 
R, for instance. It does, however, not consider the tree structure itself - the number of 
arguments still needs to be sufficient. 

4.5.3 Strongly Typed Genetic Programming 

The strongly typed Genetic Programming (STGP) approach developed by Montana [1446, 
1447, 1448] is still very close to Standard Genetic Programming. With strongly typed Genetic 
Programming, it becomes possible to use typed data structures and expressions in Genetic 
Programming. Hence, the issue of well-typedness arises, as illustrated in Figure 4.17. 

A A A A 

„x" 2 1 {3,4} 1 2 true void 

invalid invalid valid invalid 

Figure 4.17: Example for valid and invalid trees in typed Genetic Programming. 



As already mentioned in Section 4.5.2 on the previous page, in Standard Genetic Pro- 
gramming such errors are circumvented by only using representations that are type-safe per 
definition. In symbolic regression, for instance, only functions and variables which are real- 
typed are allowed, and in the evolution of logic functions only Boolean-valued expressions 
will be admitted. Thus, inconsistencies like in Figure 4.17 are impossible. 

In STGP, a tree genome is used which permits different data types that are not 
assignment-compatible. One should not mistake STGP for a fully grammar-guided approach, 
since it uses rules still based on an implicit, hard-coded internal grammar which are built 
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in the bootstrap phase of the GP system. However, it represents clearly a method to shape 
the individuals according to some validity constraints. 

These constraints are realized by modified reproduction operations that use types possi- 
bilities tables which denote which types for expressions are allowed in which level of a tree 
(individual). The creation and mutation operators now return valid individuals per default. 
Recombination still selects the node to be replaced in the first parent randomly, but the 
sub-tree in the second parent which should replace this node is selected in a way that en- 
sures that the types match. If this is not possible recombination either returns the parents 
or an empty set. 

STGP also introduces interesting new concepts like generic functions and data types, very 
much like in Ada or C [1448] and hierarchical type systems, comparable to object-oriented 
programming in their inheritance structure [910]. This way, STGP increases the reusability 
and modularity in GP which is needed for solving more complex problems [67, 1195]. 

4.5.4 Early Research in GGGP 

Research steps into grammatically driven program evolution can be traced to the early 
1990s where Antonisse [73] developed his Grammar-based Genetic Algorithm. As genome, 
he used character strings representing sentences in a formal language defined by a context- 
free grammar. Whenever crossover was to be performed, these strings were parsed into 
the derivation trees 23 of that grammar. Then, recombination was applied in the same way 
as in tree-based systems. This parsing was the drawback of the approach, leading to two 
major problems: First, it slows down the whole evolution since it is an expensive operation. 
Secondly, if the grammar is ambiguous, there may be more than one derivation tree for 
the same sentence [1382]. Antonisse's early example was succeeded by other researchers like 
Stefanski [1958], Roston [1763], and Mizoguchi et al. [1439]. 

In the mid-1990s [1382, 1785], more scientists began to concentrate on this topic. The LO- 
GENPRO system developed by Wong and Leung [2250, 2247, 2248, 2249, 2251, 2252, 2253] 
used PROLOG Definite Clause Grammars to derive first-order logic programs. A GP sys- 
tem proposed by Whigham [2201, 2202, 2203, 2205, 2204] applied context-free grammars 
in order to generate populations of derivation trees. This method additionally had the ad- 
vantage that it allowed the user to bias the evolution into the direction of certain parts 
of the grammar [2205]. Geyer-Schulz [795] derived a similar approach, differing mainly in 
the initialization procedure [241, 1382], for learning rules for expert systems. The Genetic 
Programming Kernel (GPK) by Horner [960] used tree-genomes where each genotype was a 
deviation tree generated from a BNF definition. 

4.5.5 Gads 1 

The Genetic Algorithm for Deriving Software 1 (Gads 1) by Paterson and Livesey [1620, 
1621] is one of the basic research projects that paved the way for other, more sophisticated 
approaches like Grammatical Evolution. Like the Binary Genetic Programming system by 
Keller and Banzhaf [1119], it uses a clear distinction between the search space G and the 
problem space X. The genotypes g G G in Gads are fixed-length integer strings which are 
transformed to character string phenotypes x G X (representing program syntax trees) by a 
genotype-phenotype mapping (see Section 3.8 on page 155). Because of this genome, Gads 
can use a conventional genetic algorithm engine 24 to evolve the solution candidates. 

Gads receives a context-free grammar G = (N, E, P, S) specified in Backus-Naur form 
as input. In Binary Genetic Programming, the genome encodes the sequence of terminal 
symbols of the grammar directly. Here, a genotype specifies the sequence of the productions 
to be applied to build a sentence of terminal symbols. 

23 An elaboration on derivation trees can be found in Section 30.3.3 on page 563. 

24 Gads 1 uses the genetic algorithm C++ class library GAGS (http://geneura.ugr.es/GAGS/ 

[accessed 2007-07-09] ) release 0.95. 
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= / 
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<op> 




= * 
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= log 


(9) 
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= tan 


(10) 


<pre - op > 




= sin 


(11) 


<pre - op > 




= cos 


(12) 


<var> : 


= X 


(13) 


<func> : 


= double func (double x){ 



return <expr>; 

} 

Listing 4.5: A simple grammar for C functions that could be used in Gads. 



Although Gads was primarily tested with LISP S-expressions, it can evolve sentences in 
arbitrary BNF grammars. For the sake of coherence with later sections, we use a grammar 
for simple mathematical functions in C as example. Here, the set of possible terminals 
is £ = {sin, cos, tan, log, +,-,*,/, X, (),...} and as non-terminal symbols we use N — 
{expr, op,pre-op, func}. The starting symbol is S = func and the set of productions P is 
illustrated in Listing 4.5. 

In the BNF grammar definitions for Gads, the " I " symbol commonly denoting alterna- 
tives is not used. Instead, multiple productions can be defined for the same non-terminal 
symbol. 

Every gene in a Gads genotype contains the index of the production in G to be applied 
next. For now, let us investigate the genotype g — (2, 0, 12, 5, 5, 13, 10) as example. If the 
predefined start symbol is func, we would start with the phenotype string X\ 

double func (double x){ 
return <expr>; 

} 

The first gene in g, 2, leads to the application of rule (2) to x\ and we obtain x 2 : 

double func (double x){ 

return <pre-op> (<expr>); 

} 

The next gene is 0, which means that we will use production (0)). There is a (non- 
terminal) expr symbol in x 2 , so we get X3 as follows: 

double func (double x){ 

return <pre-op> (<expr> <op> <expr>); 

} 

Now comes the next gene with allele 12 25 . We cannot apply rule (12) since no var symbol 
can be found in X3 - we simple ignore this gene and set X3 — X4. The following gene with 
value 5 translates the symbol op to - and we obtain for x 5 : 



An allele is a value of specific gene, see Definition 1.24 on page 43. 
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double func (double x){ 

return <pre-op> (<expr> - <expr>); 

} 

The next two genes, 5 and 13, must again be ignored (xj = x & = x 5 ). Finally, the last 
gene with the allele 10 resolves the non-terminal pre-op and we get for x s : 

double func (double x){ 

return sin (<expr> - <expr>); 

} 

For the remaining two expr non-terminal symbols no rule is defined in the genotype g. 
There are several ways for dealing with such incomplete resolutions. One would be to exclude 
the individual from evaluation/simulation and to give it the lowest possible objective values 
directly. Gads instead uses simple default expansion rules. In this example, we could translate 
all remaining exprs to vars and these subsequently to X. This way we obtain the resulting 
function below. 

double func (double x){ 
return sin (X - X) ; 

} 

One of the problems in Gads is the unacceptable large number of introns 26 [1619] caused 
by the encoding scheme. Many genes will not contribute to the structure of the phenotype 
since they encode productions that cannot be executed (like allele 12 in the example genotype 
g) because there are no matching non-terminal symbols. This is especially the case in "real- 
world" applications where the set of non-terminal symbols N becomes larger. 

With the Gads system, Paterson paved the way for many of the advanced techniques 
described in the following sections. 

4.5.6 Grammatical Evolution 

Like Gads, Grammatical Evolution 2 ' (GE), developed by Ryan et al. [1785], creates expres- 
sions in a given language by iteratively applying the rules of a grammar specified in the 
Backus-Naur form [1785, 1565, 1784]. 

In order to discuss how Grammatical Evolution works, we re-use the example of C-style 
mathematical functions [1785] from Section 4.5.5. Listing 4.6 specifies the according rules 
using a format which is more suitable for grammatical evolution. 

There are five rules in the set of productions P, labeled from A to E. Some of the rules 
have different options (separated by I). In each rule, options are numbered started with 0. 
When the symbol <exp> for example is expanded, for example, there are four possible results 
(0-3). The shape of the sentences produced by the grammar depends on these choices. 

Like in Gads, the genotypes in GE are numerical strings. These strings encode the indices 
of the options instead of the productions themselves. In Gads, each option was treated as 
a single production because of the absence of the " I " operator. The idea of Grammatical 
Evolution is that it is already determined which rules must be used by the non-terminal 
symbol to be expanded and we only need to decide which option of this rule is to be applied. 
Therefore, the number of introns is dramatically reduced compared to Gads. 

The variable-length string genotypes of Grammatical Evolution can again be evolved 
using genetic algorithms [1785, 1783] (like in Gads) or with other techniques, like Par- 
ticle Swarm Optimization [1578, 1568] or Differential Evolution [1567]. As illustrated in 
Figure 4.18, a Grammatical Evolution system consists of three components: the problem 
definition (including the means of evaluating a solution candidate), the grammar that defines 
the possible shapes of the individuals, and the search algorithm that creates the individuals 
[1782]. 

26 Introns are genes or sequences of genes (in the genotype) that do not contribute to the phenotype 
or its behavior, see Definition 3.2 on page 146 and Section 4.10.3. 

27 http://en.wikipedia.org/wiki/Grammatical_evolution [accessed 2007-07-05] 
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Listing 4.6: A simple grammar for C functions that could be used by GE. 
Problem 

Grammar > S J3 > Program 

Search Algorithm ■ 
Figure 4.18: The structure of a Grammatical Evolution system [1782]. 
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An Example Individual 

We get back to our mathematical C function example grammar in Listing 4.6. As already 
said, a genotype g € G is a variable-length string of numbers that denote the choices to be 
taken whenever a non-terminal symbol from N is to be expanded and more than one option is 
available (as in the productions (A), (B), and (O). The start symbol, S = func does not need 
to be encoded since it is predefined. Rules with only one option do not consume information 
from the genotype. The processing of non-terminal symbols uses a depth-first order [1785], 
so resolving a non-terminal symbol ultimately to terminal symbols has precedence before 
applying an expansion to a sibling. 

Let us assume we have settled for bytes as genes in the genome. As we may have less 
than 256 options, we apply modulo arithmetic to get the index of the option. This way, the 
sequence g = (193, 47, 51, 6, 251, 88, 63) is a valid genotype. According to our grammar, the 
first symbol to expand is 5* = func (rule (E)) where only one option is available. Therefore, 
all phcnotypes will start out like 

double func (double x){ 
return <expr>; 

} 

The next production we have to check is (A) , since it expands expr. This productions has 
four options, so taking the first number from the genotype g, we get 193 mod 4=1 which 
means that we use option (1) and obtain 
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double func (double x){ 
return (<expr> <op> <expr>); 

} 

As expr appears again, we have to evaluate rule (A) once more. The next number, 47, 
gives us 47 mod 4 = 3 so option (3) is used. 

double func (double x){ 
return (<var> <op> <expr>); 

> 

var is expanded by rule (D) where only one result is possible: 

double func (double x){ 
return (X <op> <expr>); 

> 

Subsequently, op will be evaluated to * since 51 mod 4 = 3 (rule (B) (3)) and expr becomes 
pre-op(<expr>) because 6 mod 4 = 2 (production (A)(2)). Rule (C)(3) then turns pre-op 
into cos since 251 mod 4 = 3. expr is expanded to <expr> <op> <expr> by (A) (0) because 
88 mod 4 = 0. The last gene in our genotype is 63, and thus rule (A) (3) (63 mod 4 = 3) 
transforms expr to <var> which then becomes X. 

double func (double x){ 
return (X * cos (X <op> <expr>)); 

} 

By now, the numbers available in g are exhausted and we still have non-terminal symbols 
left in the program. As already outlined earlier, there are multiple possible approaches how 
to proceed in such a situation: 

1. Mark g as invalid and give it a reasonably bad fitness. 

2. Expand the remaining non-terminals using default rules (i. e., we could say the default 
value for expr is X and op becomes +), 

3. or wrap around and restart taking numbers from the beginning of g. 

The latter method is applied in Grammatical Evolution. It has the disadvantage that it 
can possible result in an endless loop in the genotype-phenotype translation, so there should 
be a reasonable maximum for the iteration steps after which we fall back to default rules. 

In the example, we will proceed by expanding op according to (B) (1) since 193 mod 4=1 
and obtain - (minus). The next gene gives us 47 mod 4 = 3 so the last expr will become a 
<var> and finally our phenotype is: 

double func (double x){ 
return (X * cos (X -X)); 

Note that if the last gene 63 was missing in g, the "restart method" which we have just 
described would produce an infinite loop, because the first non-terminal to be evaluated 
whenever we restart taking numbers from the front of the genome then will always be expr. 
In this example, we are lucky and this is not the case since after wrapping at the genotype 
end, a pre-op is to be resolved. The gene 193 thus is an index into rule A at its first usage 
and an index into production C in the second application. 

Initialization 

Grammatical Evolution uses an approach for initialization similar to ramped half-and-half 28 , 
but on basis of derivation trees 29 . Therefore, the numbers of the choices made during a 

28 An initialization method of standard, tree-based Genetic Programming that creates a good mix- 
ture of various tree shapes [1196], see Section 4.3.1 on page 163 for more details. 

29 see Section 30.3.3 on page 563 
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random grammatical rule expansion beginning at the start symbol are recorded. Then, a 
genotype is built by reversing the modulo operation, i. e., finding a number that produces 
the same number as recorded when modulo-divided for each gene. The number of clones is 
subsequently reduced and, optionally, the single-point individuals are deleted. 



General Information 

Areas Of Application 



Some example areas of application of Grammatical Evolution are: 



Application References 



[1783, 1785] 



Mathematical Problems (vs. Standard Genetic Program- 
ming: [1196]) 

Automated Programming [1784, 1569, 1566] 

Robotics (vs. Standard Genetic Programming: [1317, 1204]) [1576, 1575] 

Economics and Finance (vs. Standard Genetic Program- ri n „„ _.„. 

ming: [1513, 1674]) [1577 ' 266 ' 264 ' 2 ^ 



There even exists an approach called "Grammatical Evolution by Grammatical Evolution" 
((GE) 2 , [1571]) where the grammar defining the structure of the solution candidates itself 
is co-evolved with the individuals. 



Conferences, Workshops, etc. 



Some conferences, workshops and such and such on Grammatical Evolution are: 



GEWS: Grammatical Evolution Workshop 
http : //www. grammatical- evolution, com/gews .html [acceded 2007-09-10] 
History: 2004: Seattle, WA, USA, see [1574] 

2003: Chicago, IL, USA, see [1573] 

2002: New York, NY, USA, see [1572] 



Online Resources 

Some general, online available ressources on Grammatical Evolution are: 



http : / /www. grammatical-evolution, com/ [accessed 2007-07-05] 
Last update: up-to-date 

Description: Grammatical Evolution Website. Includes publications, links, and software. 



Books 



Some books about (or including significant information about) Grammatical Evolution are: 

O'Neill and Ryan [1570]: Grammatical Evolution: Evolutionary Automatic Programming in 
an Arbitrary Language 
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4.5.7 Gads 2 

In Gads 2, Paterson [1619] uses the experiences from Gads 1 and the methods of the Gram- 
matical Evolution approach to tackle context-sensitive grammars with Genetic Program- 
ming. While context-free grammars are sufficient to describe the syntax of a programming 
language, they are not powerful enough to determine if a given source code is valid. Take 
for example the C snippet: 

1 char i ; 

2 i = . 5 ; 

3 y = 1; 

This is obviously not a well- typed program although syntactically correct. Context- 
sensitive grammars 50 allow productions like aA(3 — > aj(3 where A e N is a non-terminal 
symbol, and a, (5, 7 <E V* are concatenations of arbitrary many terminal and non-terminal 
symbols (with the exception that 7 7^ e, i. c., it must not be the empty string). Hence, it is 
possible to specify that a value assignment to a variable must be of the same type as the 
variable and that the variable must have previously been declared with a context-sensitive 
grammar. Paterson argues that the application of existing approaches like two-level gram- 
mars and standard attribute grammars 51 in Genetic Programming is infcasiblc [1619] and 
introduces an approach based on reflective attribute grammars. 

Definition 4.2 (Reflective Attribute Grammar). A reflective attribute grammar 
(rag 32 ) [1619] is a special form of attribute grammars. When expanding a non-terminal 
symbol with a rag production, the grammar itself is treated as an (inherited) attribute. 
During the expansion, it can be modified and is finally passed on to the next production 
step involving the newly created nodes. 

The transformation of a genotype g € G into a phenotype using a reflective attribute 
grammar r resembles Grammatical Evolution to some degree. Here we discuss it with the 
example of the recursive expansion of the symbol s: 

1. Write the symbol s to the output. 

2. If s G S, i.e., s is a terminal symbol, nothing else is to do - return. 

3. Use the next gene in the genotype g to choose one of the alternative productions that 
have s on their left hand side. If g is exhausted, choose the default rule. 

4. Create the list of the child symbols s\...s n according to the right-hand side of the 
production. 

5. For i — 1 to n do 

a) Resolve the symbol i, passing in Sj, r, and g. 

b) If needed, modify the grammar r according to the semantics of s and Sj. 

Item 5 is the main difference between Gads 2 and Grammatical Evolution. What happens 
here depends on the semantics in the rag. For example, if a non-terminal symbol that declares 
a variable x is encountered, a new terminal symbol n is added to the alphabet E that 
corresponds to the name of x. Additionally, the rule which expands the non-terminal symbol 
that stands for variables of the same type now is extended by a new option that returns n. 
Thus, the new variable becomes available in the subsequent code. 

Another difference to Grammatical Evolution is the way the genes are used to select an 
option in item 3. GE simply uses the modulo operation to make its choice. Assume we have 

30 See Section 30.3.2 on page 563 where we discuss the Chomsky Hierarchy of grammars. 

31 See Section 30.3.6 on page 565 for a discussion of attribute grammars. 

32 Notice that the shortcut of this definition rag slightly collides with the one of Recursive Adaptive 
Grammars (RAG) introduced by Shutt [1874] and discussed in Section 30.3.8 on page 568, 
although their letter cases differ. To the knowledge of the author, rags are exclusively used in 
Gads 2. 
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genotypes where each gene is a single byte and encounter a production with four options 
while the next gene has the value 45. In Grammatical Evolution, this means to select the 
second option since 45 mod 4 — 1 and we number the alternatives beginning with zero. Gads 
2, on the other hand, will divide the range of possible alleles into four disjoint intervals of 
(approximately) equal size [0, 63], [64, 127], [128, 191], [192, 255] where 45 falls clearly into the 
first one. Thus, Gads 2 will expand the first rule. 

The advantage of Gads 2 is that it allows to grow valid sentences according to context- 
sensitive grammars. It becomes not only possible to generate syntactically correct but also 
well-typed source code for most conventional programming languages. Its major drawback 
is that it has not been realized fully. The additional semantics of the production expansion 
rule 5b have not been specified in the grammar or in an additional language as input for the 
Genetic Programming system but are only exemplarily realized in a hard-coded manner for 
the programming language S- Algol [1461]. The experimental results in [1619], although suc- 
cessful, do not provide substantial benefits compared to the simpler Grammatical Evolution 
approach. 

Gads 2 shows properties that we also experienced in the past: Even if constructs like loops, 
procedure calls, or indexed access to memory are available, the chance that they are actually 
used in the way in which we would like them to be used is slim. Genetic Programming of 
real algorithms in a high-level programming language-like syntax exhibits a high affinity to 
employ rather simple instructions while neglecting more powerful constructs. Good fitness 
values are often reached with overfitting only. 

Like Grammatical Evolution, the Gads 2 idea can be realized with arbitrary genetic 
algorithm engines. Paterson [1619] uses the Java-based evolutionary computation system 
ECJ by Luke et al. [1327] as genetic algorithm engine in his experiments. 

4.5.8 Christiansen Grammar Evolution 

Christiansen Grammar, which you can find described in Section 30.3.9 on page 569, have 
many similarities to the reflective attribute grammars used in Gads 2. They are both Ex- 
tended Attribute Grammars '' 5 and the first attribute of both grammars is an inherited in- 
stance of themselves. Christiansen Grammars are formalized and backed by comprehensive 
research since being developed back in 1985 by Christiansen [402]. 

Building on their previous work de la Cruz Echcandi'a et al. [520] place the idea of Gads 
2 on the solid foundation of Christiansen Grammars with their Christiansen Grammar Evo- 
lution approach (CGE) [521]. They tested their system for finding logic function identities 
with constraints on the elementary functions to be used. Instead of elaborating on this 
experiment, let us stick with the example of mathematical functions in C for the sake of 
simplicity. 

In Listing 4.7 we define the productions P of a Christiansen Grammar that extends the 
examples from before by the ability of creating and using local variables. Three new rules 
(F) , (G) , and (H) are added, and the existing ones have been extended with attributes. 

The non-terminal symbol expr now receives the inherited attribute g which is the (Chris- 
tiansen) grammar to be used for its expansion. The J. (arrow down) indicates inherited 
attribute values that are passed down from the parent symbol, whereas |a (arrow up) iden- 
tifies an attribute value a synthesized during the expansion of a symbol and passed back to 
the parent symbol. 

The start symbol S is still func, but the corresponding production (E) has been com- 
plemented by a reference to the new non-terminal symbol stmt (line 19). The symbol stmt 
has two attributes: an inherited (input) grammar gO and a synthesized (output) grammar 
g2. We need to keep that in mind when discussing the options possible for its resolution. A 
stmt symbol can either be expanded to two new stmts in option (0), a variable declaration 



See Section 30.3.7 on page 567 for more information on such grammars. 
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(A) <expr jg> ::= <exprjg> <op|g> <exprj.g> (0) 

I (<expr lg> <op lg> <expr lg>) (1) 

I <pre-op |g> (<expr |g>) (2) 

I <var |g> (3) 

(B) <op |g> : := " + " (0) 

I "-" (1) 

I "/" (2) 

I "*" (3) 

(C) <pre-op |g> : := "log" (0) 

I "tan" (1) 

I "sin" (2) 

I "cos" (3) 

(D) <var |g> : := "X" (0) 

(E) <func |gl> ::= " double u func ( double u x ){ " 

<stmt jgl |g2> 

"return u " <expr |g2> ";" 

> (0) 

(F) <stmt |g0 |g2> ::= <stmt jgO jglXstrnt jgl |g2> (0) 

I <new-var jgO |g2> (1) 

I <assign jgO |g2> (2) 

(G) <new-var J.g |g + new-rule> :: = 

"double u " <alpha-list Ig |w> "=0;" (0) 
where <new-rule> is <var jg> ::= w 

(H) <assign jg |g> ::= <var |g> "=" <expr |g> ";" (0) 



Listing 4.7: A Christiansen grammar for C functions that that use variables. 



represented by the non-terminal symbol new-var as option (1), or to a variable assignment 
(symbol assign) in option (2). Most interesting here is option (1), the variable declaration. 

The production for new-var, labeled (G) , receives the grammar g as input. The synthesized 
attribute it generates as output is g extended by a new rule new-rule. The name of the new 
variable is a string over the Latin alphabet. In order to create this string, we make use of the 
non-terminal symbol alpha-list defined in Listing 30.12 on page 569. alpha-list inherits a 
grammar as first attribute, generates a character string w, and also synthesizes it as output. 
Production (G) uses this value w in order to build its output grammar. It creates a new rule 
(see line 29) which extends the production (D) by a new option, var can now be resolved 
to either X or to one of the new variables in subsequent expansions of expr because the 
synthesized grammar is passed up to stmt and from there to all subsequent statements (see 
rule (F) option (0)) and even by the returned expression in line 20. It should be mentioned 
that this example grammar does not prevent name collisions of the identifiers, since X, for 
instance, is also a valid expansion of new-var. 

With this grammar, a Christiansen Grammar Evolution system would proceed exactly 
as done in Section 4.5.6 on page 181. 

4.5.9 Tree- Adjoining Grammar-guided Genetic Programming 

A different approach to Grammar-guided Genetic Programming has been developed by 
Nguyen [1525] with his Tree- Adjoining Grammar-guided Genetic Programming (TAG3P) 
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system [1526, 1529, 1527, 1528, 1530]. Instead of using grammars in the Backus-Naur Form 
or one of its extensions as done in the aforementioned methods, it bases on tree-adjoining 
grammars (TAGs) which are introduced in Section 30.3.10 on page 569. 

An Example TAG grammar 

A tree-adjoining grammar can be defined as quintuple G — (N, A, I, S) where N are the 
non-terminal, S contains the terminal symbols, and S is the start symbol. TAGs support 
two basic operations: adjunction and substitution. For these operations, blueprint trees are 
provided in the set of auxiliary and initial trees respectively (A and I). Substitution is quite 
similar to expansion in BNF, the root of an initial tree replaces a leaf with the same label 
in another tree. A tree j3 to be used for adjunction has at least one leaf node v (usually 
marked with an asterisk *) with the same label as its root. It is injected into another tree 
by replacing a node with (again) that label whose children are then attached to v. 
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Figure 4.19: An TAG realization of the C-grammar of Listing 4.6. 



Let us take a look back on the tree-adjoining representation of our earlier example gram- 
mar G in Listing 4.6 on page 182 for mathematical functions in C. Figure 4.19 illustrates 
one possible realization of G as TAG. The productions are divided into the set of initial 
trees /, which are used in substitution operations, and the auxiliary trees A needed by the 
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adjunction operator. Again, the start symbol is func - this time however it identifies a tree 
in /. We additionally have annotated the trees with the index of the corresponding rule in 
Listing 4.6. It is possible that we need to build multiple TAG trees for one BNF rule, as 
done with rule 1 which is reflected in the two auxiliary tress 0i and /3 2 - The rules 3 and 12 
on the other hand have been united into one initial tree for the purpose of simplicity (It 
could have been done in the BNF in the same way). 

Like the other grammar-guided methods, the TAG3P approach uses a genotype- 
phenotype mapping. The phenotypes are, of course, trees that comply with the input tree- 
adjoining grammar. The genotypes being evolved are derivation trees that work on this 
grammar too. Derivation trees illustrate the way the productions of a grammar are applied 
in order to derive a certain sentence, as discussed in Section 30.3.3 on page 563. 

Derivation Trees 

For tree-adjoining grammars, there exist different types of derivation trees [1525]. In the 
method of Weir [2174], they are characterized as object trees where the root is labeled 
with an S-type initial tree (i. e., the start symbol) and all other trees are labeled with 
the names of auxiliary trees. Each connection from a parent p to a child node c is labeled 
with the index of the node in p being the center of the operation. Indices are determined 
by numbering the non-terminal nodes according to a preorder traversal' 54 . The number of 
adjunctions performed with each node is limited to one. Substitution operations are not 
possible with Weir's approach. Joshi and Schabes [1074] introduce an extension mitigating 
this problem. In their notation (not illustrated here) a solid connection between two nodes 
in the derivation tree stands for adjunction, whereas a broken line denotes a substitution. 

In TAG3P, Nguyen [1525] uses a restricted form of such TAG derivation trees where ad- 
junction is not permitted to (initial) trees used for substitution. This essentially means that 
all adjunctions arc performed before any substitutions. With this definition, substitutions 
become basically in-node operations. We simply attach the nodes substituted into a tree as 
list of lexemes (here terminal symbols) to the according node of a derivation tree. 

Example Mapping: Derivations Tree — > Tree 

Figure 4.20 outlines some example mappings from derivation trees on the left side to sen- 
tences of the target languages (displayed as trees) on the right side. In Figure 4.19, we have 
annotated some of the elementary trees with a or and numbers which we will use here. The 
derivation tree a.\, for example, represents the initial production for the starting symbol. 
In addition, we have attached the preorder index to each node of the trees a\, 0s, and 05 . 
In the next tree we show how the terminal symbols X and + can be substituted into 03. In 
the corresponding derivation tree, they are simply attached as a list of lexemes. A similar 
substitution can be performed with 5 , where sin is attached as terminal symbol. 

In the fourth example, the second derivation tree is adjoined to the first one. Since it 
replaces the node with the preorder index 1, the connection from 0$ to ol\ is labeled with 
1. Finally, in the fifth example, the third derivation tree is adjoined. We use the rule for 
preops to replace the node number 3 (according to preorder) in the second derivation in its 
adjoined state. 

As you can see, all initial trees as well as all trees derived from them are always valid 
sentences of the grammar. This means that we can remove any of the derivation steps and 
still get valid phenotypes. Thus, we can evaluate the share of the fitness clubbed by every 
single modification by evaluating the resulting phenotypes with and without it. 



http : //en . wikipedia . org/ wiki/Tree_traversal [accessed 2007-07-18] 
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Figure 4.20: One example genotype-phenotype mapping in TAG3P. 
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Summary 

Tree- Adjoining Grammar-guided Genetic Programming is a different approach to Grammar- 
guided Genetic Programming which has some advantages compared with the other methods. 
One of them is the increased domain of locality. All nodes of a derivation tree stay accessible 
for the reproduction operations. This becomes interesting when modifying nodes "without 
side effects to other regions of the resulting trees". If we, for example, toggle one bit in 
a Grammatical Evolution-based genotype, chances are that the meaning of all subsequent 
genes change and the tree resulting from the genotype-phenotype mapping will be totally 
different from its parent. In TAG3P, this is not the case. All operations can, at most, influence 
the node they are applied to and its children. Here, the principle of strong causality holds 
since small changes in the genotype lead to small changes in the phenotype. On the other 
hand, some of these positive effects may also be reached more easily with the wrapping 
and lifting operations for Genetic Programming introduced in this book in Section 4.3.7 on 
page 166 and Section 4.3.8. The reproduction operations of TAG3P become a little bit more 
complicated. When performing crossover, for instance, we can only exchange compatible 
nodes. We cannot adjoin the tree a\ in Figure 4.20 with itself, for example. 



General Information 

Areas Of Application 

Some example areas of application of Tree- Adjoining Grammar-guided Genetic Program- 
ming are: 

Application References 

Symbolic Regression and Function Synthesis [1529, 1527, 1528] 

Mathematical Problems [1524, 1531] 



Online Resources 



Some general, online available ressources on Tree- Adjoining Grammar-guided Genetic Pro- 
gramming are: 

http ://sc. snu . ac . kr/SCLAB/Research/publicat ions . html [accessed 2007-09-10] 
Last update: up-to-date 

Publications of the Structural Complexity Laboratory of the Seoul National 
Description: Universityj i nc i u d es Nguyen's papers about TAG3P 



4.6 Linear Genetic Programming 
4.6.1 Introduction 

In the beginning of this chapter, we have learned that the major goal of Genetic Programming 
is to find programs that solve a given set of problems. We have seen that tree genomes are 
suitable to encode such programs and how the genetic operators can be applied to them. 

Nevertheless, we have also seen that trees are not the only way for representing programs. 
Matter of fact, a computer processes them as sequences of instructions instead. These se- 
quences may contain branches in form of jumps to other places in the code. Every possible 
flowchart describing the behavior of a program can be translated into such a sequence. It is 
therefore only natural that the first approach to automated program generation developed 
by Friedberg [750] at the end of the 1950s used a fixed-length instruction sequence genome 
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[750, 751]. The area of Genetic Programming focused on such instruction string genomes is 
called linear Genetic Programming (LGP). 

Linear Genetic Programming can be distinguished from approaches like Grammatical 
Evolution (see Section 4.5.6 on page 181) by the fact that strings there are just genotypic, 
intermediate representations that encode the program trees. In LGP, they are the center 
of the whole evolution and contain the program code directly. Some of the most important 
early contributions to this field come from [1667]: 

1. Banzhaf [135], who used a genotype-phenotype mapping with repair mechanisms to 
translate a bit string into a sequence of simple arithmetic instructions in 1993, 

2. Perkis [1636] (1994), whose stack based GP evaluated arithmetic expressions in Reverse 
Polish Notation (RPN), 

3. Openshaw and Turton [1582] (1994) who also used Perkis's approach but already repre- 
sented mathematical equations as fixed-length bit string back in the 1980s [1581], and 

4. Crepeau [464], who developed a machine code GP system around an emulator for the 
Z80 processor. 

Besides the methods discussed in this section, other interesting approaches to linear Genetic 
Programming are the LGP variants developed by Eklund [627] and Leung et al. [1273, 380] 
on specialized hardware, the commercial system by Foster [736], and the MicroGP (/iGP) 
system for test program induction by Corno et al. [451, 1949]. 

4.6.2 Advantages and Disadvantages 

The advantage of linear Genetic Programming lies in the straightforward evaluation of the 
evolved algorithms. Its structure furthermore eases limiting the runtime in the program 
evaluation and even simulating parallelism. The drawback is that simply reusing the genetic 
operators for variable-length string genomes (discussed in Section 3.5 on page 149), which 
randomly insert, delete, or toggle bits, is not really feasible. In LGP forms that allow arbi- 
trary jumps and call instructions to shape the control flow, this becomes even more eminent 
because of a high degree of epistasis (see Section 1.4.6 and Section 4.8). 

We can visualize, for example, that the alternatives and loops which we know from 
high-level programming languages are mapped to conditional and unconditional jump in- 
structions in machine code. These jumps target to cither absolute or relative addresses inside 
the program. Let us consider the insertion of a single, new command into the instruction 
string, maybe as result of a mutation or recombination operation. If we do not perform 
any further corrections after this insertion, it is well possible that the resulting shift of the 
absolute addresses of the subsequent instructions in the program invalidates the control flow 
and renders the whole program useless. This issue is illustrated in Fig. 4. 21. a. Nordin et al. 
[1546, 1546] point out that standard crossover is highly disruptive. Even though the sub- 
tree crossover in tree-genomes is shown to be not very efficient either [62], in comparison, 
tree-based genomes are less vulnerable in this aspect. The loop in Fig. 4.21.b, for instance, 
stays intact although it is now one useless instruction richer. In LGP, precautions have to be 
taken in order to mitigate these problems, linear Genetic Programming becomes more com- 
petitive to Standard Genetic Programming also in terms of robustness of the recombination 
operations. 

One approach to do so is to create intelligent mutation and crossover operators which 
preserve the control flow of the program when inserting or deleting instructions. Such op- 
erations could, for instance, analyze the program structure and automatically correct jump 
targets, for instance. Operations which are restricted to have only minimal effect on the 
control flow from the start can also easily be introduced. In Section 4.6.6, we shortly outline 
some of the work of Brameier and Banzhaf, who define some interesting approaches to this 
issue. Section 4.6.7 discusses the homologous crossover operation which represents another 
method for decreasing the destructive effects of reproduction in LGP. 
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Fig. 4. 21. a: Inserting into an instruction string. 
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Fig. 4.21.b: Inserting in a tree representation. 




Figure 4.21: The impact of insertion operations in Genetic Programming 



4.6.3 The Compiling Genetic Programming System 

Its roots go back to Nordin [1541], who was dissatisfied with the performance of GP systems 
written in an interpreted language which, in turn, interpret the programs evolved using a 
tree-shaped genome. In 1994, he published his work on a new Compiling Genetic Program- 
ming System (CGPS) written in the C programming language ' 1 [1126] directly manipulating 
individuals represented as machine code. 

Each solution candidate consisted of a prologue for shoveling the input from the stack into 
registers, a set of instructions for information processing, and an epilogue for terminating the 
function [1542]. The prologue and epilogue were never modified by the genetic operations. 
As instructions for the middle part, the Genetic Programming system had arithmetical 
operations and bit-shift operators at its disposal in [1541], but no control flow manipulation 
primitives like jumps or procedure calls. These were added in [1543] along with ADFs, 
making this LGP approach Turing-complete. 

Nordin [1541] used the classification of Swedish words as task in the first experiments 
with this new system. He found that it had approximately the same capability for grow- 
ing classifiers as artificial neural networks but performed much faster. Another interesting 
application of his system was the compression of images and audio data [1545]. 



4.6.4 Automatic Induction of Machine Code by Genetic Programming 

CGPS originally evolved code for the Sun Sparc processors, which is a member of the 
RISC 36 processor class. This had the advantage that all instructions are have the same 
size. In the Automatic Induction of Machine Code with GP system (AIM-GP, AIMGP), 
the successor of CGPS, the support for multiple other architectures was added by Nordin, 
Banzhaf, and Francone [1549, 1550], including Java bytecode 37 and CISC 38 CPUs with 
variable instruction widths such as Intel 80x86 processors. A new interesting application for 

35 http : //en. wikipedia. org/wiki/C_ (programming_language) [accessed 2008-09-16] 

36 http://de.wikipedia.org/wiki/Reduced_Instruction_Set_Computing [accessed 2008-09-113] 

37 http://en.wikipedia.org/wiki/Bytecode [accessed 2008-09-16] 

38 http://de.wikipedia.org/wiki/Complex_Instruction_Set_Computing [accessed 2008-09-16] 
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void ind ( double [8] v) { 



v [0] = v [5] +73; 
v[7] = v[0] - 59; 
if(v[l] > 0) 
if(v[5] > 23) 



(I) 



v [4] = v [2] * v [1] ; 
v [2] = v [5] + v [4] ; 
v[6] = v[0] * 25; 
v[6] = v[4] - 4; 
v [1] = sin(v [6] ) ; 
if(v[0] > v[l]) 



(I) 
(I) 



(I) 
(I) 



v [3] = v [5] * v [5] ; 
v[7] = v[6] * 2; 
v[5] = v[7] + 115; 
if(v[l] <= v[6]) 



(I) 



v [1] = sin(v [7] ) ; 





Listing 4.8: A genotype of an individual in Brameier and Banzhaf s LGP system. 



linear Genetic Programming tackled with AIMGP is the evolution of robot behavior such 
as obstacle avoiding and wall following [1548]. 

4.6.5 Java Bytecode Evolution 

Besides AIMGP, there exist numerous other approaches to the evolution of linear Java 
bytecode functions. The Java Bytecode Genetic Programming system (JBGP, also Java 
Method Evolver, JME) by Lukschandl ct al. [1328, 1329, 1330, 1331] is written in Java. 
A genotype in JBGP contains the maximum allowed stack depth together with a linear 
list of instruction descriptors. Each instruction descriptor holds information such as the 
corresponding bytecode and the branch offset. The genotypes are transformed with the 
genotype-phenotype mapping into methods of a Java class which then can be loaded into 
the JVM, executed, and evaluated. [903, 902]. 

The JAPHET system of Klahold et al. [1147], the user provides an initial Java class at 
startup. Classes are divided into a static and a dynamic part. The static parts contain things 
like version information are not affected by the reproduction operations. The dynamic parts, 
containing the methods, are modified by the genetic operations which add new byte code 



Harvey et al. [903, 902] introduce byte code GP (bcGP), where the whole population of 
each generation is represented by one class file. Like in AIMGP, each individual is a linear 
sequence of Java bytecode and is surrounded by a prologue and epilogue. Furthermore, by 
adding buffer space, each individual has the same size and, thus, the whole population can 
be kept inside a byte array of a fixed size, too. 

4.6.6 Brameier and Banzhaf: LGP with Implicit Intron removal 

In the Genetic Programming system developed by Brameier and Banzhaf [272] based on 
former experience with AIMGP, an individual is represented as a linear sequence of simple 
C instructions as outlined in the example Listing 4.8 (a slightly modified version of the 
example from [272]). Due to reproduction operations like as mutation and crossover, such 
genotypes may contain introns, i.e., instructions not influencing the result (sec Definition 3.2 
and Section 4.10.3). Given that the output of the program defined in Listing 4.8 will store 
its outputs in v[0] and v[l], all the lines marked with (I) do not contribute to the overall 



[903, 902]. 
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functional fitness. Bramcier and Banzhaf [272] introduce an algorithm which removes these 
introns during the genotype-phenotype mapping, before the fitness evaluation. This linear 
Genetic Programming method was successfully tested with several classification tasks [272, 
271, 273], function approximation and Boolean function synthesis [274]. 

In his doctoral dissertation, Brameier [269] elaborates that the control flow of linear 
Genetic Programming more equals a graph than a tree because of jump and call instructions. 
In the earlier work of Brameier and Banzhaf [272] mentioned just a few lines ago, introns were 
only excluded by the genotype-phenotype mapping but preserved in the genotypes because 
they were expected to make the programs robust against variations. In [269], Brameier 
concludes that such implicit introns representing unreachable or ineffective code have no real 
protective effect but reduce the efficiency of the reproduction operations and, thus, should 
be avoided or at least minimized by them. Instead, the concept of explicitly defined introns 
(EDIs) proposed by Nordin et al. [1547] is utilized in form of something like nop instructions 
in order to decrease the destructive effect of crossover. Brameier finds that introducing EDIs 
decreases the proportion of introns arising from unreachable or ineffective code and lead to 
better results. In comparison with standard tree-based GP, his linear Genetic Programming 
approach performed better during experiments with classification, regression, and Boolean 
function evolution benchmarks. 

4.6.7 Homologous Crossover: Binary Reproduction 

According to Banzhaf et al. [140], natural crossover is very restricted and usually exchanges 
only genes that express the same functionality and are located at the same positions (loci) 
on the chromosomes. 

Definition 4.3 (Homology). In genetics, homology 39 of protein-coding DNA sequences 
means that they code for the same protein which may indicate common functionality. Ho- 
mologous chromosomes 40 are either chromosomes in a biological cell that pair during meiosis 
or non-identical chromosomes which code for the same functional feature by containing sim- 
ilar genes in different allelic states. 

In other words, homologous genetic material is very similar and in nature, only such 
material is exchanged in sexual reproduction. In linear Genetic Programming with default 
crossover, it is hard for the evolution to establish a clear structure or a map between locus 
and functionality. Francone et al. [740, 1549] introduce a sticky crossover operator which 
resembles homology by allowing the exchange of instructions between two genotypes (pro- 
grams) only if they reside at the same loci. It first chooses a sequence of code in the first 
genotype and then swaps it with the sequence at exactly the same position in the second 
parent. 

4.6.8 Page-based LGP 

A similar approach is the Page-based linear Genetic Programming of Heywood and Zincir- 
Heywood [923] , where programs are described as sequences of pages, each including the same 
number of instructions. Here, crossover exchanges only a single page between the parents 
and, as a result, becomes less destructive. This approach should be distinguished from the 
fixed block size approach of Nordin et al. [1549] for CISC architectures which was developed 
to accommodate variable instruction lengths in AIMGP. 

4.7 Graph-based Approaches 

In this section, we will discuss some Genetic Programming approaches that are based on 
graphs rather than on trees or linear sequences of instructions. 

39 http://en.wikipedia.org/wiki/Homology_Cbiology) [accessed 2008-06-17] 

40 http://en.wikipedia.org/wiki/Homologous_chromosome [accessed 2008-06-17] 
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4.7.1 Parallel Algorithm Discovery and Orchestration 

Parallel Algorithm Discovery and Orchestration (PADO) is a Genetic Programming method 
introduced by Teller and Veloso [2011, 2015] in the mid-1990s. In their CEC paper [2014] 
and their 1996 book chapter [2016], they describe the graph-structure of their approach as 
sketched in Figure 4.22. A PADO program is a directed graph of up to n nodes, where 




Figure 4.22: The general structure of a Parallel Algorithm Discovery and Orchestration 
program. 



each node may have as many as n outgoing arcs which define the possible control flows. A 
node consists of two parts: and action and a branching decision. The programs used indexed 
memory and an implicitly accessed stack. The actions pop their inputs from the stack and 
place their outputs onto it. After a node's action has been executed, the branching decision 
function is used to determine over which of the outgoing arcs the control will be transferred. 
It can access the stack, the memory, and the action type of the previous node in order to 
make that decision. 

A program in the PADO-syntax has a start node which will initially receive the control 
token and an end node which terminates the program after its attached action has been 
performed. Furthermore, the actions may call functions from a library and automatically 
defined functions (ADFs). These ADFs basically have same structure as the main program 
and can also invoke themselves recursively. 

As actions, PADO provides algebraic primitives like +, -, *, /, NOT, MAX, and MM; the 
memory instructions READ and WRITE; branching primitives like IF-THEN-ELSE and PIFTE (al- 
ternative with a randomized condition); as well as constants and some domain-specific in- 
structions. Furthermore, actions may invoke more complex library functions or ADFs. An 
action takes its arguments from the stack and also pushes its results back onto it. The action 
6, for instance, pushes 6 on the stack whereas the WRITE action pops two values, v\ and V2, 
from it and pushes the value of the memory cell indexed by vi before storing v\ at this 
location. 

In PADO, so-called SMART operators are used for mutation and recombination which 
co-evolve with the main population as described in [2013, 2016]. 

4.7.2 Parallel Distributed Genetic Programming 

Parallel Distributed Genetic Programming (PDGP) is a method for growing programs in 
the form of graphs that has been developed by Poli [1654, 1655, 1658, 1659] in the mid 
1990s. In PDGP, a graph is represented as a fixed-size, n-dimensional grid. The nodes of the 
grid are labeled with operations, functions, or references to variables. Except for the latter 
case, they are connected to their inputs with directed links. Both, the labels as well as the 
connections in the grid are subject to evolution. 
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Figure 4.23: The term max {x * y, x * y + 3} 



In order to illustrate this structure, we use the term max {a; * y,x * y + 3} as example. 
We already have elaborately discussed how we can express mathematical terms as trees. 
Fig. 4. 23. a illustrates such a function tree. Using a directed graph, as outlined in Fig. 4.23.b, 
we can retrieve a more compact representation of the same term by reusing the expression 
x * y. Evolving such graphs is the goal of PDGP. Therefore, a grid structure needs to be 
defined first. In Fig. 4.23.C, we have settled for a two dimensional 4*3 grid. Additionally, we 
add a row at the top containing one cell for each output of the program. We can easily fill the 
graph from Fig. 4.23.b into this grid. This leaves some nodes unoccupied. If we assume that 
Fig. 4.23.C represents a solution grown by this Genetic Programming approach, these nodes 
would be labeled with some unused expressions and would somehow be connected without 
any influence on the result of the program. Such an arbitrary configuration of inactive nodes 
(or introns and links is sketched in light gray in Fig. 4.23.C. The nodes which have influence 
on the result of the program, i. e., those which are connected to an output node directly or 
indirectly, are called active nodes. 

We may impose restrictions on the connectivity of PDGP graphs. For instance, we can 
define that each row must only be connected to its predecessor in order to build layered 
feed-forward networks. We can transform any given parallel distributed program (i. c., any 
given acyclic graph) into such a layered network if we additionally provide the identity 
function so pass-through nodes can evolve as shown in Fig. 4.23.C. Furthermore, we could 
also attach weights to the links between the nodes and make them also subject to evolution. 
This way, we can also grow artificial neural networks [1657]. However, we can as well do 
without any form of restrictions for the connectivity and may allow backward connections 
in the programs, depending on the application. 

An interesting part of PDGP is how the programs are executed. Principally, it allows 
for a great proportion of parallelism. Coming back to the example outlined Fig. 4.23.C, the 
values of the leaf nodes could be computed in parallel, as well those of the pass-through and 
the addition node. 



Genetic Operations 

For this new program representation, novel genetic operations are needed. 
Creation 

Similar to the grow and full methods for creating trees in Standard Genetic Programming 
introduced in Section 4.3.1 on page 162, it is possible to obtain balanced or unbalanced 
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graphs/trees in PDGP, depending whether we allow variables and constants to occur any- 
where in the program or only at a given, predetermined depth. 

Crossover: Binary Reproduction 

SAAN Crossover The basic recombination operation in PDGP is Sub-graph Active-Active 
Node (SAAN) crossover. The idea of SAAN crossover is that active sub-graphs represent 
functional units which should be combined in different ways in order to explore new, useful 
constellations. It proceeds as follows: 

1. Select a random active node in each parent, the crossover points. 

2. Extract the sub-graph that contains all the (active) nodes that influence the result of 
the node marking the crossover point in the first parent. 

3. Insert this sub-graph at the crossover point in the second parent. If its x-coordinate is 
incompatible and some nodes of the sub-graph would be outside the grid, wrap it so 
that these nodes are placed on the other side of the offspring. 

Of course, we have to ensure that the depths of the crossover points are compatible 
and no nodes of the sub-graph would "hang" below the grid in the offspring. This can 
be achieved by first selecting the crossover point in the first parent and then choosing a 
compatible crossover point in the second parent. 

SSAAN Crossover The Sub-Sub-Graph Active-Active Node (SSAAN) Crossover method 
works essentially the same way, with one exception: it disregards crossover point depth 
compatibility. It may now happen that we want to insert a sub-graph into an offspring at 
a point where it does not fit because it is too long. Here we make use of the simple fact 
that the lowest row in a PDGP graph always is filled with variables and constants only - 
functions cannot occur there because otherwise, no arguments could be connected to them. 
Hence, we can cut the overhanging nodes of the sub-graph and connect the now unsatisfied 
arguments at second-to-last row with the nodes in the last row of the second parent. Of 
course, we have to pay special attention where to cut the sub-graph: terminal nodes that 
would be copied to the last row of the offspring can remain in it, functions cannot. 

SSI AN Sub-Sub- Graph Inactive- Active Node (SSI AN) Crossover works exactly like SSAAN 
crossover except that the crossover point in the first parent is chosen amongst both, active 
and inactive nodes. 

Mutation: Unary Reproduction 

We can extend the mutation operation from Standard Genetic Programming easily to PDGP 
by creating new, random graphs and insert them at random points into the offspring. In the 
context of PDGP, this is called global mutation and can be achieved by creating a completely 
new graph and performing crossover with an existing one. 

Furthermore, link mutation is introduced as an operation that performs simple local 
changes on the connection topology of the graphs. 

ADLs 

Similar to Standard Genetic Programming, we can also introduce automatically defined 
functions 41 in PDGP by extending the function set with a new symbol which then executes 
an (also evolved) subprogram when being evaluated. Automatically Defined Links, ADLs, 
work similarly, except that a link is annotated with the subprogram-invoking symbol [1653, 
1656]. 



More information on ADFs can be found in Section 4.3.9 on page 167. 
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4.7.3 Genetic Network Programming 

Genetic Network Programming (GNP) is a Genetic Programming technique introduced by 
Katagiri et al. [1095] at the 2001 CEC conference in Seoul [1095, 1096, 1097, 931]. In GNP, 
programs are represented as directed graphs called networks which consist of three types of 
nodes: the start node, judgment nodes and processing nodes. A processing nodes executes 
an action from a predefined set of actions P and can have exactly on outgoing connection 
to a successor node. Judgment nodes may have multiple outgoing connections and have one 
expression from the set of possible judgment decisions J attached to them with which they 
make this decision. As in the example illustrated in Figure 4.24, each node in the network is 




represented by two genes, a node gene and a connection gene. The node gene consists of two 
values, the node type (which is for processing nodes and 1 for judgment nodes) and the 
function index. For processing nodes, the function index can take on the values from to 
|P| — 1 and for judgment nodes, it is in 0..| J| — 1. These values identify the action or decision 
function to be executed whenever the node receives the control token. In the connection gene, 
the indices of the other nodes the node is connected to arc stored. For processing nodes, 
this list has exactly one entry, for judgment nodes there always are at least two outgoing 
connections (in Figure 4.24, there are exactly two). Notice that programs can be interpreted 
in this representation directly without needing an explicit genotype-phenotype mapping. 

Crossover is performed by randomly exchanging notes (and their attached connections) 
between the parent networks and mutation randomly changes the connections. Murata and 
Nakamura [1492] extended their approach in order to evolve programs for multi-agent sys- 
tems where the behavior of agents depends on the group they are assigned groups to. In 
this Automatically Defined Groups (ADG) model, an individual is defined as a set of GNP 
programs [1491, 1490]. Genetic Network Programming has also been combined with rein- 
forcement learning by Mabu et al. [1337]. 

4.7.4 Cartesian Genetic Programming 

Cartesian Genetic Programming (CGP) was developed by Miller and Thomson [1421] in 
order to achieve a higher degree of effectiveness in learning Boolean functions [1418, 1422]. 
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In his 1999 paper, Miller [1418] explains the idea of Cartesian Genetic Programming with the 
example of a program with o — 2 outputs that computes both, the difference and the sum, 
of the volumes of two boxes V\ = XiX 2 X 3 and V 2 = YiY 2 Y 3 . As illustrated in Figure 4.25, 
the i = 6 input variables Xi . . . A3 and Y\...Y$, placed to the left, are numbered from 
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Figure 4.25: An example for the GPM in Cartesian Genetic Programming. 



to 5. As function set, we use {+ = 0, — = 1, * = 2, / = 3, V = 4, A = 5, © = 6, = 7}. Like 
in PDGP, we define a grid of cells before the evolution begins. In our example, this grid 
is n = 3 cells wide and m — 2 cells deep. Each of the cells can accommodate an arbitrary 
function and has a fixed number of inputs and outputs (in the example i' = 2 and o' = 1, 
respectively). The outputs of the cells, similarly to the inputs of the program, are numbered 
in ascending order beginning with i. The output of the cell in the top-left has number 6, 
the one of the cell below number 7, and so on. This numeration is annotated in gray in 
Figure 4.25. 

Which functions the cells should carry out and how their inputs and outputs are con- 
nected will be decided by the optimization algorithm. Therefore, we could use, for instance, 
a genetic algorithm with or without crossover or a hill climbing approach. The genotypes of 
Cartesian Genetic Programming are fixed-length integer strings. They consist of n*m genes, 
each encoding the configuration of one cell. Such a gene starts with %' numbers identifying 
the incoming data and one number (underlined in Figure 4.25) denoting the function it will 
carry out. Another gene at the end of the genotype identifies which of the available data are 
"wired" to the outputs of the program. 

By using a fixed-length genotype, the maximum number of expressions in a Cartesian 
program is also predefined. It may, however, be shorter, since not all internal cells are 
necessarily connected with the output-producing cells. Furthermore, not all functions need 
to incorporate all i' inputs into their results, -i, which is also part of the example function 
set, for instance, uses only the first of its i' = 2 input arguments and ignores the second one. 

Levels-back, a parameter of CGP, is the number of columns to the left of a given cell 
whose outputs may be used as inputs of this cell. If levels-back is one, the cell with the 
output 8 in the example could only use 6 or 7 as inputs. A levels-back value of 2 allows 
it to be connected with 0-5. Of course, the reproduction operations have to respect the 
levels-back value set. 

CGP labeled itself a Genetic Programming technique from the beginning. However, most 
of the work contributed about it did not consider a recombination operation. Hence, one 
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could regarded it also as an evolutionary programming 42 method. Lately, researchers also 
begin to focus on efficient crossover techniques for CGP [414]. 



Neutrality in CGP 

Cartesian Genetic Programming explicitly utilizes different forms of neutrality 43 in order to 
foster the evolutionary progress. Normally, neutrality can have positive as well as negative 
effects on the evolvability of a system. Yu and Miller [2297, 2296] outline different forms 
of neutrality in Cartesian Genetic Programming which also apply to other forms of GP or 
GAs: 

1. Inactive genes define cells that are not connected to the outputs in any way and hence 
cannot influence the output of the program. Mutating these genes therefore has no effect 
on the fitness and represents explicit neutrality. 

2. Active genes have direct influence on the results of the program. Neutral mutations here 
are such modifications that have no influence on the fitness. This implicit neutrality is 
the results of functional redundancy or introns. 

Their experiments indicate that neutrality can increase the chance of success of Genetic 
Programming for needle-in-a-haystack fitness landscapes and in digital circuit evolution 
[2110]. 



Embedded Cartesian Genetic Programming 

In 2005, Walker and Miller [2135] published their work on Embedded Cartesian Genetic 
Programming (ECGP), a new type of CGP with a module acquisition [66] method in form 
of automatic module creation [2135, 2136, 2137]. Therefore, three new operations are intro- 
duced: 

1. Compress randomly selects two points in the genotype and creates a new module 
containing all the nodes between these points. The module then replaces these nodes with 
a cell that invokes it. The compress operator has the effect of shortening the genotype 
of the parent and of making the nodes in the module immune against the standard 
mutation operation but does not affect its fitness. Modules are more or less treated like 
functions so cell to which a module number has been assigned now uses that module as 
"cell function" . 

2. Expand randomly selects a module and replaces it with the nodes inside. Only the cell 
which initially replaced the module cells due to the Compress operation can be expanded 
in order to avoid bloat. 

3. The new operator Module Mutation changes modules by adding or removing inputs 
and outputs and may also carry out the traditional one-point mutation on the cells of 
the module. 



General Information 

Areas Of Application 



Some example areas of application of Cartesian Genetic Programming are: 



Application References 

„ . . . ,„ ,„ . [1420, 2110, 1421, 1419, 1081, 

Electrical Engineering and Circuit Design 2136] 



42 See Chapter 6 on page 231 for more details on evolutionary programming. 

43 See Section 1.4.5 on page 64 and Section 1.4.5 on page 67 for more information on neutrality and 
redundancy. 
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Symbolic Regression and Function Synthesis [1418, 1422, 2297, 1422] 

Robotics [895, 2138] 

Prime Number Prediction [2139] 



Online Resources 

Some general, online available ressources on Cartesian Genetic Programming are: 



http://www.cartesiangp.co.uk/ [accessed 2007-1102] 
Last update: up-to-date 

Description: The homepage of Cartesian Genetic Programming 
http : //www . emoware . org/evolutionary_art . asp [accessed 2007-1102] 
Last update: 2006 

Description: A website with art pieces evolved using Cartesian Genetic Programming. 



4.8 Epistasis in Genetic Programming 

In the previous sections, we have discussed many different Genetic Programming approaches 
like Standard Genetic Programming and the Grammar-guided Genetic Programming family. 
We also have elaborated on linear Genetic Programming techniques that encode an algorithm 
as a stream of instructions, very much like real programs are represented in the memory of 
a computer. 

When we use such methods to evolve "real algorithms" , we often find that the fitness land- 
scape is very rugged. To a good part, this ruggedness is rooted in epistasis (see Section 1.4.6 
on page 68). In the following section, we want to discuss the different epistatic effects which 
can be observed in Genetic Programming. 

Subsequently, we will introduce some means to mitigate the effects of epistasis. One such 
approach would be to "learn the linkage" (see Section 1.4.6) between the single instructions. 
Module acquisition [66] can be regarded as one idea on how to do this. Generally, the linkage 
between the primitives in GP is far more complicated than in usual genetic algorithm- 
problems, which is why the author tends to believe that linkage learning will not achieve the 
same success in the GP than it did in the area of genetic algorithms. Therefore, we consider 
methods which consider the representation of the solution candidates rather than the nature 
of the search operations applied to the genotypes in order to mitigate or circumvent epistasis 
more promising. In Section 4.8.2 to Section 4.8.4, we will discuss three such methods. 



4.8.1 Forms of Epistasis in Genetic Programming 
Semantic Epistasis 

In an algorithm, the behavior of each instruction depends on the operations that have been 
executed before. The result of one instruction will influence the behavior of those executed 
afterwards. If an instruction is changed, if the arithmetic operation a = b + c is swapped 
with a = b - c, for instance, its effects on subsequent instructions will change too [2011]. 
This obvious fact fully complies with the definition of epistasis and we will refer to it as 
semantic epistasis. 



Positional Epistasis 



Epistasis also occurs in form of positional interdepcndcncics. In order to clarify the role of 
this facet in the context of Genetic Programming, we begin with some basic assumptions. 
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Let us consider a program P as some sort of function P : 1 1— ► O that connects the possible 
inputs I of a system to its possible outputs O. Two programs P\ and P2 can be considered 
as equivalent if P\{i) = Vi € I. 44 

For the sake of simplicity, we further define a program as a sequence of n statements 
P = (si, S2, ■ ■ ■ , s n ). For these statements, there are n\ possible permutations. We argue 
that the fraction 9(P) = of m permutations that lead to programs equivalent to P is one 
measure of robustness for a given program representation. More precisely, a low value of 9 
indicates a high degree of positional epistasis, which means that the loci (the positions) of 
many different genes in a genome have influence on their functionality [1502]. This reduces, 
for example, the efficiency of reproduction operations like recombination, since they often 
change the number and order of instructions in a program. 
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Fig. 4.26.b: in Standard Genetic Programming 
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Fig. 4.26.c: in linear Genetic Programming 
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Fig. 4.26.d: in genotype-phenotype mappings, like in Grammatical Evolution-like ap- 
proaches 

Figure 4.26: Positional epistasis in Genetic Programming. 



Many of the phenotypic and most genotypic representations in Genetic Programming 
mentioned so far seem to be rather fragile in terms of insertion and crossover points. One of 

44 In order to cover stateful programs, the input and output sets may also comprise sequences of 
data. 



204 4 Genetic Programming 

the causes is that their genomes have high positional epistasis (low ^-measures), as sketched 
in Figure 4.26. 

Embryogenic Epistasis (Problems of String-to- Tree GPMs) 

Many Grammar-guided Genetic Programming methods like Grammatical Evolution 45 , 
Christiansen Grammar Evolution 46 , and Gads 4 ' employ a genotype-phenotype mapping 
between an (integer) string genome and trees that represent sentences in a given grammar. 
According to Ryan et al. [1785], the idea of mapping string genotypes can very well be 
compared to one of the natural prototypes of artificial embryogeny 48 : the translation of the 
DNA into proteins. This process depends very much on the proteins already produced and 
which are now present around the cellular facilities. If a certain piece of DNA has created a 
protein X and is transcriped again, a molecule of protein type Y may result because of the 
presence of X. 

Although this is a nice analogy, it also bears an important weakness. Search spaces which 
exhibit such effects usually suffer from weak causality 49 [1382] and only the lord knows 
why the DNA stuff works at all. Different from the aforementioned positional epistasis, 
this embryogenic episasis interacts with the genotype-phenotype mapping and modifies the 
phenotypic outcome. In Grammatical Evolution for example, a change in any gene in a 
genotype g will likely also change the meaning of all alleles following after its locus. This 
means that mutation and crossover will probably have very destructive impacts on the 
individuals [1525]. Hence, even the smallest change in the genotype can modify the whole 
structure and functionality of the phenotype. A valid solution can become infeasible after a 
single reproduction operation. 

Figure 4.27 outlines how the change of a single bit in a genotype (in hexadecimal notation) 
may lead to a drastic modification in the tree structure when a string-to-tree mapping 
is applied. The resulting phenotype in the example has more or less nothing in common 
with its parent except maybe the type of its root node. Furthermore, the efficiency of the 
reproduction operations of the mentioned approaches will likely decrease with a growing set 
of non-terminal symbols and corresponding productions. 

reproduction 
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Figure 4.27: Epistasis in a Grammatical Evolution-like approach. 



45 See Section 4.5.6 on page 181, [1785] 

46 See Section 4.5.8 on page 186, [521] 

47 See Section 4.5.5 on page 179, [1620] 

48 Find out more about artificial embryogeny in Section 3.8 on page 155. 

49 The principles of causality and locality are discussed in Section 1.4.3 on page 61. 
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The points discussed in this section do by no means indicate that the involved Genetic 
Programming approaches are infeasible or deliver only inferior results. Most of them have 
provided human-competitive solutions or performed even better. We just point out some 
classes of problems that, if successfully solved, could even increase the utility of the GGGP 
methods even further. 



4.8.2 Algorithmic Chemistry 

Algorithmic Chemistries were first discussed by Fontana [720] on basis of the A-calculus. The 

Algorithmic Chemistry approach by Lasarczyk and Banzhaf [1258]BL2005GPOAAC,LB2005GPOAAC,LB2005TSO 
represents one possible method to circumvent the positional epistasis discussed in Sec- 
tion 4.8.1. To put it bluntly, this form of Algorithmic Chemistries is basically a variant 
of linear Genetic Programming 50 where the execution order of the single instructions is 
defined by some random distribution instead of being fixed as in normal programs. 

This can probably best be described by using a simple example. Therefore, let us define 
the set of basic constructs which will make up the programs first. In both, [138] and [1259], 
an assembler-like language is used where each instruction has three parameters: two input 
and one output register addresses. Registers are the basic data stores, the variables, of this 
language. Instructions with a behavior depending on a single input only simply ignore their 
second parameter. In [138], Banzhaf and Lasarczyk use a language comprising the eight 
instructions +, -, /, *, and, or, and not. Furthermore, they provided eleven read-only 
registers with evolved constants and thirty read-write registers, the so-called connection 
registers . 
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Fig. 4. 28. a: Linear Genetic Program- 
ming phenotype and execution. 
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Fig. 4.28.b: Algorithmic Chemistry phenotype and 
execution. 



Figure 4.28: The difference between linear Genetic Programming and Algorithmic 
Chemistries. 



In Fig. 4. 28. a we have illustrated the sequential structure of a normal program which 
might have been evolved in a linear Genetic Programming experiment. Whenever it is exe- 
cuted, be it for determining its fitness or later, as part of an application, the instructions are 



The linear Genetic Programming approach is outlined in Section 4.6 on page 191. 



206 4 Genetic Programming 



processed one after another, step by step. This execution scheme is common to all off-the- 
shelf PCs where the CPU uses an internal register (the instruction pointer) which points to 
the instruction to be executed next and which is incremented in this process. 

Programs in the Algorithmic Chemistry representation can evolve essentially in the same 
way as programs in linear Genetic Programming do. As genotypes, exactly the same se- 
quences of instructions can be used. This similarity, however, stops at the phenotypic level 51 . 
Here, the programs are considered as multisets which do not define any order on their el- 
ements (the instructions), as sketched in Fig. 4.28.b. When such a program is executed, a 
random sequencer draws one element from this set in each time step and executes it. 52 

This approach clearly leads to a 0-value of zero, since all positional dependencies amongst 
the instructions have vanished. As trade-off, however, there are a number of interesting side 
effects. Since programs no longer arc sequences, there is, for example, no last instruction 
anymore. Thus, the Algorithmic Chemistry programs also have no clear "end" . Therefore, 
the randomized execution step is performed for a fixed number of iterations - five times the 
number of instructions in [1258]. As pictured in Fig. 4.28.b, a certain instruction may occur 
multiple times in a program, which increases its probability of being picked for execution. 

The biggest drawback of this approach is that the programs are no longer deterministic 
and their behavior and results may vary between two consecutive executions. Therefore, 
multiple independent runs should always be performed and the median or mean return value 
of them should be considered as the true result. Stripping the instructions of their order also 
will make it harder for higher-level constructs like alternatives or loops to evolve, let alone 
modules or functions. On the positive side, it also creates a large potential for parallclization 
and distribution which could be beneficial especially in multi-processor systems. 

4.8.3 Soft Assignment 

Another approach for reducing the epistasis is the soft assignment method (memory with 
memory) by McPhee and Poli [1385]. It implicitly targets semantic epistasis by weakening 
the way values are assigned to variables. 

In traditional programs, instructions like x=y or mov x, y will completely overwrite the 
value of x with the value of y. McPhee and Poli replace this strict assignment semantic with 

x *+i =7t = < — lit + (1 - 7) x * (4-3) 

where x t+1 is the value that the variable x will have after and x t its value before the assign- 
ment, y* is the value of an arbitrary expression which is to be stored in x. The parameter 7 is 
"a constant that indicates the assignment hardness" [1385]. For 7 = 1, the assignments are 
completely overwriting as in normal programming and for 7 = 0, the values of the variables 
cannot be changed. McPhee and Poli [1385] report that 7 = 0.7 performed best (better than 
7=1) when applied to different symbolic regression 53 problems. 

For mathematical or approximation problems, this approach is very beneficial. The draw- 
back of programs using soft assignment is that, although they are deterministic, there are 
situations where precise values are required which they may not be able to compute. Assume, 
for instance, that an algorithm is to be evolved which returns the largest element max from 
an input list 1. This program could contain an instruction like if l[i]>max then max=l[i]. 
If a 7- value smaller than one is applied, the final value in max will most likely be no element 
of the list. 

51 Although we distinguish between genotype and phenotype, no genotype-phenotype mapping is 
needed since the randomized execution can perfectly be performed on an array of instructions. 

52 Banzhaf and Lasarczyk [138] use the same approach for introducing a new recombination operator 
which creates an offspring by drawing instructions randomly from both parents. 

53 See Section 23.1 on page 397 for more information on symbolic regression. 
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4.8.4 Rule-based Genetic Programming 

Besides the Algorithmic Chemistry approach of Lasarczyk and Banzhaf [1258, 1259] and soft 
assignments, there exists one very general class of evolutionary algorithms that elegantly 
circumvents positional epistasis 1 ' 4 : the (Learning) Classifier Systems family [948, 946] which 
you can find discussed in Chapter 7 on page 233. 

In the Pittsburgh LCS approach associated with Spears and De Jong [1926], a population 
of rule sets is evolved with a genetic algorithm [1912, 1926] . Each individual in this population 
consists of multiple classifiers (the rules) that transform input signals into output signals. 
The evaluation order of the rules in such a classifier system C plays absolutely no role except 
maybe for rules concerning the same output bits, i. e., 0(C) w 1. 

The basic idea behind the Rule-based Genetic Programming approach is to use this 
knowledge to create a new program representation that retains high 0-valucs in order to 
become more robust in terms of reproduction operations [2181]. With RBGP, the afore- 
mentioned disadvantages (such as non-determinism) of Algorithmic Chemistries and soft 
assignments are completely circumvented. RBGP may be considered as a high-level LCS 
variant which introduces mightier concepts like mathematical operations. It furthermore ex- 
hibits a certain amount of non-uniform neutrality which, as we believe, is likely to increase 
the chance of finding better solutions. 

We illustrate this new Genetic Programming method by using an example in Figure 4.29. 
Like in Pitt-style Learning Classifier Systems, the depicted programs consist of arbitrary 
many rules which can be encoded binary. A rule evaluates the values of the symbols in its 
condition part (left of =>) and, in its action part, assigns a new value to one symbol or 
may invoke any other procedure provided in its configuration. In its structure, the RBGP 
language is similar to Dijkstra's Guarded Command Language 56 (GCL) [568]. 57 



Genotype and Phenotype 

Before the evolution in Rule-based Genetic Programming begins, the number of symbols 
and their properties must be specified as well as the possible actions. Each symbol identifies 
an integer variable which is either read-only or read-write. Some read-only symbols are 
defined for constants like and 1, for instance. The symbol start is only 1 during the first 
application of the rule set and becomes afterwards (but may be written to by the program) . 
Furthermore, a program can be provided with some general-purpose variables (a and b in 
the example). Additional symbols with special meanings can be introduced. For evolving 
distributed algorithms, for instance, an input symbol in where incoming messages will occur 
and a variable out from which outgoing messages can be transmitted from could be added. 
If messages should be allowed to contain more than one value, multiple such symbols have 
to be defined. These out symbols may trigger message transmission directly when written 
to as in Figure 4.29. Alternatively, a message can be sent by a special send action. 

An action set containing mathematical operations like addition, subtraction, value as- 
signment, and an equivalent to logical negation' 8 is sufficient for many problems but may 
be extended arbitrarily. In conjunction with the constants and 1 and the comparison 
operation, the evolutionary process can build arbitrary complex logical expression. 

From the initial symbol and action specifications, the system can determine how many 
bits are needed to encode a single rule. A binary encoding where this is the size of the genes 
in variable-length bit string genotypes can then be used. With such simple genotypes, any 
possible nesting depth of condition statements and all possible logical operations can be 

54 You can find positional epistasis discussed in Section 4.8.1 on page 202 

55 as a measure for positional epistasis has been defined in Section 4.8.1 on page 202. 

56 http://en.wikipedia.org/wiki/Guarded_Command_Language [accessed 2008-07-24] 

57 I have to thank David R. White for this information - when devising Rule-based Genetic Pro- 
gramming, I didn't even know that the Guarded Command Language existed. 

58 In order to emulate a logical not, we use the expression 1-x where x can be an arbitrary symbol. 
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Figure 4.29: Example for a genotype-phenotype mapping in Rule-based Genetic Program- 
ming. 



encoded. If needed, a tree-like program structure (as in Standard Genetic Programming) 
can be constructed from the rule sets, since each rule corresponds to a single conditional 
statement in a normal programming language. 

There are similarities between our RBGP and some special types of LCSs, like the ab- 
stracted LCS by Browne and Ioannides [295] and the S-expression-based LCS by Lanzi and 
Perrucci [1250]. The two most fundamental differences lie in the semantics of both, the rules 
and the approach: In RBGP, a rule may directly manipulate symbols and invoke external 
procedures with (at most) two in/out-arguments. This includes mathematical operations 
like multiplication and division which do not exist a priori in LCSs but would have to evolve 
on basis of binary operations, which is, although possible, very unlikely. Furthermore, the 
individuals in RBGP are not classifiers but programs. Classifiers are intended to be executed 
once for a given situation, judge it, and decide upon an optimal output. A program, on the 
other hand, runs independently, asynchronously performs a computation, and interacts with 
its environment. Also, the syntax of RBGP is very extensible because the nature of the sym- 
bols and actions is not bound to specific data types but can easily be adapted to floating 
point computation, for instance. 

Program Execution and Dimensions of Independence 

The simplest method for executing a rule-based program is to loop over it in a cycle. Al- 
though this approach is sufficient for simulation purposes, it would result in a waste of CPU 
power on a real system. This consumption of computational power (and thus, energy) can 
be reduced very much if the conditional parts of the rules are only evaluated when one of 
the symbols that they access changes. 
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Positional Independence 

Changes in the values of the symbols can cither be caused by data incoming from the outside, 
like messages which are received (and stored in the in-symbols in our example) or by the 
actions invoked by the program itself. In RBGP, actions do not directly modify the values 
of the symbols but rather write their results to a temporary storage. After all actions have 
been processed, the temporary storage is committed to the real memory, as sketched in 
Figure 4.30. The symbols in the condition part and in the computation parts of the actions 
are annotated with the index t and those in the assignment part of the actions are marked 
with t + 1 in order to illustrate this issue. 
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Figure 4.30: A program computing the faculty p of a natural number a in Java and RBGP 
syntax. 



This approach allows for a great amount of disarray in the rules since the only possible 
positional dependencies left are those of rules which write to the same symbols. All other 
rules can be freely permutated without any influence on the behavior of the program. Hence, 
the positional epistasis in RBGP is very low. 

Cardinality Independence 

By excluding any explicit learning features (like the Bucket Brigade Algorithm' 19 used in 
Learning Classifier Systems [942, 943]), we also gain insensitivity in terms of rule cardinality. 
It is irrelevant whether a rule occurs once, twice, or even more often in a program. If triggered, 
all occurrences of the rule use the same input data and thus, will write the same values to the 
temporary variable representing their target symbol. Assuming that an additional objective 
function which puts pressure into the direction of smaller programs is always imposed, 
superfluous rules will be wiped out during the course of the evolution anyway. 

Neutrality 

The existence of neutral reproduction operations can have a positive as well as negative 
influence on the evolutionary progress (see Section 1.4.5 on page 64). The positional and 
cardinality independence are clear examples of phenotypic neutrality and redundancy in 
RBGP. They allow a useful rule to be replicated arbitrarily often in the same program 
without decreasing its functional fitness. This is likely to happen during crossover. By doing 
so, the conditional parts of the rule will (obviously) be copied too. Subsequent mutation 
operations may now slightly modify the rule and lead to improved behavior, i. e., act as 



The Bucket Brigade Algorithm is discussed in Section 7.3.8 on page 240. 
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exploitation operations. Based on the discussion of neutrality, we expect this form of non- 
uniform redundancy to have a rather positive effect on the evolution. 

All these new degrees of freedom are achieved without most of the drawbacks that are 
inherent in Algorithmic Chemistries. The program flow is fully deterministic and so are its 
results. Like in Algorithmic Chemistries, it is harder to determine the number of steps needed 
for program execution, although we can easily detect the termination of local algorithms as 
the point where an execution does not lead to any change in the symbols. 

Complex Statements 

From the previous descriptions, it would seem that rule-based programs are strictly sequen- 
tial, without branching or loops. This is not the case. Instead, a wide variety of complex 
computations can be expressed with them. Here we will give some intuitive examples for 
such program structures in RBGP syntax. 

Complex Conditions 

Assume that we have five variables a to e and want to express something like 

if( (a<b) && (Od) && (a<d) ) { 
a += c ; 
c— ; } 

Listing 4.9: A complex conditional statement in a C-like language. 

We can do this in RBGP with four rules: 

true A true => et+i = 
(at < b t ) A (c t > dj e t+ i = 1 
(a t < d t ) A (et = 1) => a t +i = a t + c t 
(a t < d t ) A (et = 1) => c t +i = c t - 1 

Listing 4.10: The RBGP version of Listing 4.9. 

Although this does not look very elegant, it fulfills the task by storing the result of the 
evaluation of the condition as logical value in the variable e. e will normally be because of 
line 1 and is only set to 1 by rule 2. Since both rules write to the same temporary variable, 
the then-part of the condition in Listing 4.9 (lines 3 and 4 in Listing 4.10) will be reached in 
the next round if a<d holds too. Notice that the only positional dependency in Listing 4.10 
is that rule 2 must always occur after rule 1. All rule permutations that obey this statement 
are equivalent (hence, 9 = 0.5) and so is Listing 4.11: 
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Listing 4.11: An equivalent alternative version of Listing 4.10. 

Loops 

Loops in RBGP can be created in the very same fashion, 
b = 1; 

for(a=c; a>0; a--) { 
b *= a ; 
} 

Listing 4.12: A loop in a C-like language. 
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The loop defined in Listing 4.12 can be expressed in RBGP as outlined in Listing 4.13, 
where we use the start-symbol (line 1 and 2) to initialize a and b. As its name suggests, 
start is only 1 at the very beginning of the program's execution and afterwards (unless 
modified by an action). 

(startt > 0) V false => a t +i = c t 

true A (start t > 0) => b t +i = 1 
(a t > 0) A true => a t +i = a t — 1 
false V (a* > 0) => bt+i = b t * a t 

Listing 4.13: The RBGP-version of Listing 4.12. 

Here, no positional or cardinality restrictions occur at all, so Listing 4.14 is equivalent 
to Listing 4.13 and 8 = 1. 
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Listing 4.14: An equivalent, alternative version of Listing 4.13. 



Extended Rule-based Genetic Programming 

We have shown that, although looking rather simple, the primitives of Rule-based Genetic 
Programming are mighty enough to express many of the constructs known from high-level 
programming languages. However, in the original RBGP approach, there are some inherent 
limitations. 

Its most obvious drawback is the lack of Turing completeness. In order to visualize this 
problem, imagine the restriction that only simple types like integer variables and parameters 
were allowed was imposed on the Java programming language. Then, no complex types like 
arrays could be used. In this case, it would become hard to create programs which process 
data structures like lists, since single variables for each and every of their elements would have 
to be defined and accessed independently Writing a method for sorting a list of arbitrary 
length would even become impossible. 

The same restrictions hold in Rule-based Genetic Programming as introduced in Sec- 
tion 4.8.4 - the symbols there resemble plain integer variables. In Java, the problems stated 
above are circumvented with arrays, a form of memory which can be accessed indirectly. 
Adding indirect memory access to the programming language forming the basis of Rule- 
based Genetic Programming would allow the evolution of more complex algorithms - mat- 
ter of fact, this is the standard approach for creating Turing-complete representations. We 
therefore define the notation [a t ] t , which stands for the value of the (a t )th symbol at time 
step t in the ordered list of all symbols. In this, it equals a simple pointer dereferentiation 
(*a) in the C language. 

With this extension alone, it becomes possible to use the RBGP language for defining list 
sorting algorithms, for instance. Assume that the following symbols (i , ii, .., i n -ij 1, a jb) 
have been defined. The symbols i to i„_i constitute the memory which can be used to 
store the list elements and 1 is initialized with the length of the list, i. e., the number of the 
i-elements actually used (which has to be smaller or equal ton), a and b are multi-purpose 
variables. In the symbol list, i is at position 0, 1 at position n, a at index n + 1 and so 
on. With very little effort, Listing 4.15 can be defined which performs a variant of selection 
sort 60 . Notice that, since writing to variables is not committed before all rules were applied, 
no explicit temporary variable is required in the third and fourth rule. 



http : //en. wikipedia. org/wiki/Selection_sort [accessed 2008-05-09] 
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Listing 4.15: A simple selection sort algorithm written in the cRBGP language. 

Listing 4.10, one of our previous examples shows another feature of RBGP which might 
prove troublesome: The condition part of a rule always consists of two single conditions. 
This is totally unimportant as long as a logical expression to be represented contains only 
one or two conditions. (If it needs only one, the second condition of the rule may be set to 
true and concatenated with an A-operator.) In Listing 4.10, however, we try to represent 
the conditional statement from Listing 4.9 which consists of three conditions. In order to do 
so, we needed to introduce the additional symbol e. 

Here we can draw an analogy to the human memory 61 which may be divided into pro- 
cedural 62 (implicit 63 ) memory [848, 1715, 1818, 2229] storing, for instance, motor skills and 
declarative 64 (explicit 65 ) memory [1145, 1155] holding facts and data. In comparison with 
RBGP, we would find that the expressiveness of the equivalent of the procedural memory 
in RBGP is rather limited, which needs to be mitigated by using more of it and storing 
additional information in the declarative memory. We used this approach when translating 
Listing 4.9 to Listing 4.10, for instance. This issue can be compared to a hypothetical situ- 
ation in which we were not able to learn the complete motion of lifting a jar to our lips and 
instead, could only learn how to lift a jar from the table and how to move an already lifted 
jar to our mouth while needing to explicitly remember that both moves belong together. 

Admittedly, this analogy may be a bit farfetched, but it illustrates that Rule-based Ge- 
netic Programming could be forced to go through a seemingly complex learning process for 
building a simple algorithm under some circumstances. We therefore extend its expressive- 
ness by dropping the constraints on the structure of its rules which increases the number 
of ways that RBGP can be utilized for representing complicated expressions. The ability 
of using genetic algorithms with fixed-size genes for evolving rule-based programs, however, 
has to be traded in in order to facilitate this extension. Additionally, this extension might 
bring back some of the epistasis which we had previously successfully decreased. 

((at < b t ) A ((c t > dt) A (a t <_*))) => a t+ i = (a t + c t ) 
((at < b t ) A (.(.Ct > dt) A (a t < dt))) => c t+ i = (c t - 1) 

Listing 4.16: The eRBGP version of Listing 4.9 and Listing 4.10. 

(startt =/= 0) => bt+i = 1 
(startt / 0) => ct+i = a t 

(a t > 0) => a t +i = (a t - 1) 

(a t > 0) b t+ i = (b t * a t ) 

Listing 4.17: The eRBGP version of Listing 4.12 and Listing 4.13. 

In Listing 4.16 and Listing 4.17, we repeat the RBGP examples Listing 4.10 and 
Listing 4.13 - this time in eRBGP syntax. As already mentioned, we cannot use a sim- 
ple genetic algorithm to evolve these programs since their structure does not map to a fixed 
gene size anymore. However, tree-based Standard Genetic Programming as discussed in Sec- 
tion 4.3 can perfectly fulfill this purpose. Listing 4.16, for instance, fits to the tree phenotype 
depicted in Figure 4.31. 

61 http://en.wikipedia.org/wiki/Memory [accessed 2008-05-08] 

62 http://en.wikipedia.org/wiki/Procedural_memory [accessed 2008-05-08] 

63 http://en.wikipedia.org/wiki/Implicit_memory [accessed 2008-05-os] 

64 http://en.wikipedia.org/wiki/Declarative_memory [accessed 2008-05-08] 

65 http://en.wikipedia.org/wiki/Explicit_memory [accessed 2008-05-08] 
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Figure 4.31: The tree phenotype (and genotype) of Listing 4.16. 



With these changes, Extended Rule-based Genetic Programming becomes much more 
powerful in comparison with plain Rule-based Genetic Programming and is now able to 
evolve arbitrary algorithms and data structures. Also, the proof for Turing completeness of 
Genetic Programming languages with indexed memory by Teller [2012] can easily be adapted 
to Extended Rule-based Genetic Programming (as well as the simpler strategy by Nordin 
and Banzhaf [1543]). 



4.9 Artificial Life and Artificial Chemistry 

It is not hard to imagine what artificial life is. Matter of fact, I assume that everyone of us 
has already seen numerous visualizations and simulations showing artificial creatures. Even 
some examples from this book like the Artificial Ant may well be counted to that category. 

Definition 4.4 (Artificial Life). Artificial life 66 , also abbreviated with ALife or AL, is 
a field of research that studies the general properties of life by synthesizing and analyzing 
life-like behavior [165]. 

Definition 4.5 (Artificial Chemistry). The area of artificial chemistries 6 ' subsumes all 
computational systems which are based on simulations of entities similar to molecules and 
the reactions amongst them. 

According to Dittrich et al. [575], an artificial chemistry is defined by a triple (S, R, A), 
where S — {si, s 2 , s 3 , . . .} is the set of possible molecules S, R is the set of reactions that 
can occur between them, and A is an algorithm defining how the reactions are applied. 

Artificial chemistry and artificial life strongly influence each other and often merge into 
each other. The work of Hutton [977, 978, 979], for example, focuses on generating and 
evolving self-replicating molecules and cells. There also exists real- world applications of ar- 
tificial chemistry in many areas of chemistry, computer networking, economics, and sociology. 
The Algorithmic Chemistries, which we have analyzed in Section 4.8.2 on page 205, are also 
closely related to artificial chemistries, for instance. In this section, we will discuss some more 
artificial life and artificial chemistry approaches in the context of Genetic Programming. 

66 http://en.wikipedia.org/wiki/Artificial_life [accessed 2007-1213] 

67 http://en.wikipedia.org/wiki/Artificial_chemistry [accessed 2008-05-01] 
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4.9.1 Push, PushGP, and Pushpop 



In 1996, early research in self-modification or self-evolution of programs has been conducted 
by Spector and Stoffel [1935] in form of the ontogenic extension of their HiGP system [1965]. 
Basically, they extended the programs evolved with linear Genetic Programming method 
with the capabilities of shifting and copying segments of their code at runtime. 

About half of a decade later, Spector [1930] developed Push, a stack-based programming 
language especially suitable for evolutionary computation [1930, 1934, 1932]. Programs in 
that language can be evolved by adapting existing Standard Genetic Programming systems 
(as done in PushGP) or, more interestingly, by themselves in an autoconstructive manner, 
which has been realized in the Pushpop system. Currently, the Push language is currently 
available in its third release, PushS [1938, 1939]. 

A Push program is either a single instruction, a literal, or a sequence of zero or more 
Push programs inside parentheses. 

program ::= instruction I literal I ( {program} ) 

An instruction may take zero or more arguments from the stack. If insufficient many 
arguments are available, it acts as NQOP, i. e., does nothing. The same goes if the arguments 
are invalid, like when a division by zero would occur. 

In Push, there is a stack for each data type, including integers, Boolean values, floats, 
name literals, and code itself. The instructions are usually named according to the scheme 
<type>.<operation>, like INTEGER. +, BOOLEAN. DUP, and so on. One simple example for a Push 
program borrowed from Spector [1932], Spector et al. [1938] is 

( 5 1.23 INTEGER. + ( 4 ) INTEGER.- 5.67 FLOAT.* ) 
Which will leave the stacks in the following states: 
FLOAT STACK : (6.9741) 

CODE STACK : ( ( 5 1.23 INTEGER. + ( 4 ) INTEGER.- 5.67 
FLOAT.* ) ) 

INTEGER STACK: (1) 

Listing 4.18: A first, simple example for a Push program. 

Since all operations take their arguments from the corresponding stacks, the initial 
INTEGER. + docs nothing because only one integer, 5, is available on the INTEGER stack. INTEGER 
.- subtracts the value on the top of INTEGER stack (4) from the one beneath it (5) and leaves 
the result (l) there. On the float stack, the result of the multiplication FLOAT.* of 1.23 and 5 
. 67 is left while the whole program itself resides on the CODE stack. 

Code Manipulation 

One of the most interesting features of Push is that we can easily express new forms of 
control flow or self-modifying code with it. Here, the CODE stack and, since Push3, the EXEC 
stack play an important role. Let us take the following example from [1930, 1938]: 

(CODE.QUOTE (2 3 INTEGER. +) CODE. DO) 

Listing 4.19: An example for the usage of the CODE stack. 

The instruction CODE . QUOTE leads to the next piece of code ((2 3 INTEGER. +) in this case) 
being pushed onto the CODE stack. CODE. DO then invokes the interpreter on whatever is on 
the top of this stack. Hence, 2 and 3 will land on the INTEGER stack as arguments for the 
following addition. In other words, Listing 4.19 is just a complicated way to add 2 + 3 = 5. 

(CODE . QUOTE 

(CODE.QUOTE (INTEGER. POP 1) 
CODE.QUOTE (CODE. DUP INTEGER. DUP 1 INTEGER.- CODE . DO INTEGER.*) 
INTEGER. DUP 2 INTEGER. < CODE. IF) 
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5 CODE. DO) 

Listing 4.20: Another example for the usage of the CODE stack. 

Listing 4.20 outlines a Push program using a similar mechanism to compute the factorial 
of an input provided on the INTEGER stack. It first places the whole program on the CODE 
stack and executes it (with the CODE. DO at its end). This in turn leads on the code in lines 2 
and 3 being placed on the code stack. The INTEGER . DUP instruction now duplicates the top 
of the INTEGER stack. Then, 2 is pushed and the following INTEGER. < performs a comparison 
of the top two elements on the INTEGER stack, leaving the result (true or false) on the 
BOOLEAN stack. The instruction CODE. IF executes one of the top two items of the CODE stack, 
depending on the value it finds on the BOOLEAN stack and removes all three elements. So 
in case that the input element was smaller than 2, the top element of the INTEGER stack 
will be removed and 1 will be pushed into its place. Otherwise, the next instruction CODE 
.DUP duplicates the whole program on the CODE stack (remember, that everything else has 
already been removed from the stack when CODE. IF was executed). INTEGER. DUP copies the 
top of the INTEGER stack, 1 is stored and then subtracted from this duplicate. The result is 
then multiplied with the original value, leaving the product on the stack. So, Listing 4.20 
realizes a recursive method to compute the factorial of a given number. 

Name Binding 

As previously mentioned, there is also a NAME stack in the Push language. It enables us 
to bind arbitrary constructs to names, allowing for the creation of named procedures and 
variables. 

i ( DOUBLE CODE . QUOTE ( INTEGER . DUP INTEGER. + ) CODE. DEFINE ) 

Listing 4.21: An example for the creation of procedures. 

In Listing 4.21, we first define the literal DOUBLE which will be pushed onto the NAME stack. 
This definition is followed by the instruction CODE. QUOTE, which will place code for adding 
an integer number to itself on the CODE stack. This code is then assigned to the name on top 
of the NAME stack (DOUBLE in our case) by CODE. DEFINE. From there on, DOUBLE can be used as 
a procedure. 

The EXEC Stack 

Many control flow constructs of Push programs up to version 2 of the language are executed 
by similar statements in the interpreter. Beginning with Push3, all instructions are pushed 
onto the new EXEC stack prior their invocation. Now, now additional state information or 
flags are required in the interpreter except from the stacks and name bindings. Furthermore, 
the EXEC stack supports similar manipulation mechanisms like the CODE stack. 

i ( DOUBLE EXEC. DEFINE ( INTEGER . DUP INTEGER. + ) ) 

Listing 4.22: An example for the creation of procedures similar to Listing 4.21. 

The EXEC stack is very similar to the CODE stack, except that its elements are pushed in 
the inverse order. The program in Listing 4.22 is similar to Listing 4.21 [1938]. 

Autoconstructive Evolution 

Push3 programs can be considered as tree structures and hence be evolved using standard 
Genetic Programming. This approach has been exercised with the PushGP system [1930, 
1934, 1933, 463, 1745]. However, the programs Ccin i ilso be equipped with the means to create 
their own offspring. This idea has been realized in a software called Pushpop [1930, 1934, 
1931]. In Pushpop, whatever is left on top of the CODE stack after a programs execution is 
regarded as its child. Programs may use the above mentioned code manipulation facilities 
to create their descendants and can also access a variety of additional functions, like 
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1. CODE . RAND pushes newly created random code onto the CODE stack. 

2. NEIGHBOR takes an integer n and returns the code of the individual in distance n. The 
population is defined as a linear list where siblings are grouped together. 

3. ELDER performs a tournament between n individuals of the previous generation and 
returns the winner. 

4. OTHER performs a tournament between n individuals of the current generation, comparing 
individuals according to their parents fitness, and returns the winner. 

After the first individuals able to reproduce have been evolved the system can be used 
to derive programs solving a given problem. The only external influence on the system is a 
selection mechanism required to prevent uncontrolled growth of the population by allowing 
only the children of fit parents to survive. 

4.9.2 Fraglets 

In his seminal work, Tschudin [2058] introduced Fraglets 68 , a new artificial chemistry suitable 
for the development and even evolution of network protocols. Fraglets represent an execution 
model for communication protocols which resembles chemical reactions. 

How do Fraglets work? 

From the theoretical point of view, the Fraglet approach is an instance of Post's string rewrit- 
ing systems 69 [1672] and Gamma systems [127, 131, 128, 129, 130]. Fraglets arc symbolic 
strings of the form [si : S2 ■ ■■■ ■ s n ] . The symbols Sj either represent control information 
or payload. Each node in the network has a Fraglet store which corresponds to a reaction 
vessel in chemistry. Such vessels usually contain equal molecules multiple times and the 
same goes for Fraglet stores which can be implemented as multisets keeping track on the 
multiplicity of the Fraglets they contain. 

Tschudin [2058] defines a simple prefix programming language with a fixed instruction 
set comprising transformation and reaction rules for Fraglets. Transformations like dup and 
nop modify a single Fraglet whereas reactions such as match and matchP combine two Fraglets. 
For the definition of these rules in Table 4.2, we will use the syntax [si : S2 : ... : tail] 
where s, is a symbol and tail is a possibly empty sequence of symbols. 70 

Obviously, the structure of Fraglets is very different from other program representations. 
There are no distinguishable modules or functions, no control flow statements such as jumps 
or function invocations, and no distinction exists between memory and code. Nevertheless, 
the Fraglet system is powerful, has a good expressiveness, and there are indications that it 
is likely Turing-complete [571]. 

Examples 

After defining the basics of the Fraglet approach, let us now take a look on a few simple 
examples. 

Election 

Election in a distributed system means to select one node in the network and to ensure 
that all nodes receive knowledge of the ID of the selected one. One way to perform such an 
election is to determine the maximum ID of all nodes, which is what we will do here. 

68 See http : //en. wikipedia. org/wiki/Fraglets [acceded 2008-05-02] and http : //www . fraglets .net/ 
[accessed 2008-05-02] for more information. 

69 http://en.wikipedia.org/wiki/Post_canonical_system [accessed 2008-05-02] 

70 See http://www.fraglets.net/frag-instrset-20070924.txt [accessed 2008-05-02] for the full in- 
struction set as of 2007-09-24. 
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tag transformation/reaction 



► basic transformations 

dup [dup : t : a : tail] — > [t : a : a : tail] 

duplicate a single symbol 
exch [exch : t : a : b : tail] — > [t : b : a : tail] 

swap two tags 

fork [fork : a : b : tail] — > [a : tail] , [b : tail] 

copy Fraglet and prepend different header symbols 
nop [nop : tail] — > [tail] 

does nothing (except consuming the instruction tag) 
null [null : tail] — > 

destroy a Fraglet 
pop2 [pop2 : h : t : a : tail] — > [h : a] , [t : tail] 

pop head element a out of a list [a : b : tail] 
split [split : taih ■ * : taih] — > [taih] , [tails] 

break a Fraglet into two at the first occurrence of * 

► arithmetic transformations 

sum [sum : t : {m} : {n} : tail] — > [t : {m + n} : tail] 

an operation comparing two numbers 

It lit ■ ves ■ no ■ \a\ ■ ib} ■ tail] — { [y6S : { ° } : W tail] if a < b 

lt [U ■ yGS ■ U ° ■ |aj ' |bj • taUl {[no : {a} : {b} : tail] otherwise 

a logic operation comparing two numbers a and b 

► communication primitives 

broadcast [broadcast : tail] — > n [tail] 

broadcast tail to all nodes n in the network N 
send [send : dest : tail] — > dest[tail] 

send tail to a single node dest 
node n [node : t : tail] — > n [t : {id(n)} : tail] 

obtain the ID id(n) of the current node n 

► reactions 

match [match : a : taih] , [a : taih] — > [taih ■ taih] 

two Fraglets react, their tails are concatenated 
matchP [matchP : a : taih] , [a '■ taih] — > [matchP : a : taih] , [taih '■ taih] 

"catalytic match", i. e., the matchp rule persists 

Table 4.2: Some Fraglet instructions (from [2058] and http://www.fraglets.net/ (2008- 
05-02)). 



First of all, we define five additional symbols A, B, C, L, and R. These symbols do not 
react on their own behalf, i. e., [A : tail] — ► [A : tail]. After a bootstrap reaction, each 
node in the network will contain a Fraglet of the form [L : {id}] containing the identifier id 
of the node that it thinks has won/currently leads in the election. It broadcasts this informa- 
tion to its neighbors, which will receive it in the form of the Fraglet [R : {id}]. A, B, and 
C have no further meaning. The election algorithm is constituted by six Fraglets [node : L] 
which creates the first L-type Fraglet at bootstrap, [matchP : L : fork : L : A] and 
[matchP : A : broadcast : R] which are used to transmit the highest node ID currently 
known in a i?-type Fraglet, [matchP : R : match : L : It : B : C] which initiates a com- 
parison with incoming Fraglets of that type, and [matchP : B : pop2 : null : L] and 
[matchP : B : pop2 : L : null] evaluating the outcome of this comparison and produc- 
ing a new L-type Fraglet. Figure 4.32 shows the flow of reactions in this system, which will 
finally lead to all nodes knowing the highest node ID in the network. 
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[match P : A : broadcast : R] 

^ > [broadcas 

[L:{idl}] [A:{idl}] 



[fork:L:A:{idl}] 




-[matchP:L:fork:L:A] 



bootstrap 

[node:l_] — > [L:{ndl}] [matchP:R:match:L:lt 



-[match:l_:"lt:B:C:{id2}] 



Network 





[R:{id2>] 



["lt:B:C:{id2}:{idl}] 



(id2<idlV 
[B:{id2}:{idl}] 



^id2>idl) 
[C:{id2}:{idl}] 



[matchP: B : pop2 : L : null ] 



[pop2:L:nu"l"l :{idl}:{id2}] [pop2 :null :L:{idl}:{id2}] 




[matchP:C: pop2 : null : L] 



/ \ 



/ \ 



[l_:{idl}] [null :{id2}] [null:{idl}] [L:{id2}] 



Fraglet store of one node 



Figure 4.32: A Fraglet-based election algorithm 



Definition 4.6 (Quine). A quint 11 is a computer program which produces a copy of itself 
(or its source code) as output. 

From Kleene's second recursion theorem' 2 [1148], it follows that quines can be defined in 
each Turing-complete language. Yamamoto et al. [2276, 1399] have introduced quine Fraglets 
like the one in Figure 4.33 as vehicle for self-replicating and self-modifying programs. 

Fraglets as a program representation are predestined for evolutionary protocol synthesis. 
Indeed, they have a low positional epistasis (see Section 4.8.1 on page 202), since the order 
of the Fraglets in the Fraglet store plays no role. The order of the single commands inside a 
Fraglet, however, is significant. In Section 23.2.2 on page 404, we discuss the application of 

71 http : //en. wikipedia. org/wiki/Quine_'/ 28computingy,29 [accessed 2008-05-04] 

72 http://en.wikipedia.org/wiki/Kleeney.27s_recursion_theorem [accessed 2008-05-04] 
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[nop : match : x : fork : nop : x] 



[match :x: fork: nop: x] — © — [x:match:x:fork:nop:x] 



bootstrap 

[fork: nop: x : match :x: fork: nop: x] 



Figure 4.33: A simple quine Fraglet (borrowed from [2276]) 



Fraglets for protocol evolution based on the work by Tschudin [2058] and in Section 23.2.2 on 
page 404, the joint work of Yamamoto and Tschudin [2275] on online adaptation of Fraglet 
protocols is outlined. 



4.10 Problems Inherent in the Evolution of Algorithms 

Genetic Programming can be utilized to breed programs or algorithms and programs suitable 
for a given problem class. In order to guide such an evolutionary process, these programs 
bred have to be evaluated. They are assessed in terms of functional and non-functional 
requirements. The functional properties comprise all features regarding how good the algo- 
rithm solves the specified problem and the non-functional aspects are concerned with, for 
example, its size and memory consumption. Normally, a set F = {fi,-,f n } of objective 
functions is specified in order to map these attributes to the subsets Yi,..,Y n of the real 
numbers R. 



4.10.1 Correctness of the Evolved Algorithms 
Introduction 

Genetic Programming can be utilized to breed programs or algorithms suitable for a given 
problem class. In order to guide such an evolutionary process, the synthesized programs 
have to be evaluated, i. e., assessed in terms of functional and non- functional requirements. 
The functional properties comprise all features regarding how good a program solves the 
specified problem and the non-functional aspects are concerned with, for example, its size 
and memory consumption. Often, a set F = {/_, .., / n } of objective functions is specified in 
order to map these attributes to the real numbers. 



The Problems 

In Genetic Programming, some of the non-functional objective values such as the size of 
the evolved programs can easily be computed. Determining their functional utility, how- 
ever, cannot be achieved by any arithmetically closed function or algorithm - at least if a 
Turing-complete representation is chosen - since this would be an instance of the Entschei- 
dungsproblem' 3 [496] as well as of the Halting Problem 74 [1894]. 

73 http://en.wikipedia.org/wiki/Entscheidungsproblem [accessed 2007-07-03] 

74 http://en.wikipedia.org/wiki/Halting_problem [accessed 2007-07-03] 
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Entscheidungsproblem 

The Entscheidungsproblem, formulated by David Hilbert [926, 927], asks for an algorithm 
that, if provided with a description of a formal language and a statement in that language, 
can decide whether or not the statement holds [2219]. In the case of Genetic Programming, 
the formal language is the language in which the programs are evolved, i. e., the problem 
space, and the statements are the programs themselves. Church [407, 408] and Turing [2065, 
2066] both proved that an algorithm solving the Entscheidungsproblem cannot exist. 

No Exhaustive Testing 

It is not possible to use some kind algorithm in order to determine whether the evolved 
programs will provide correct results. Thus, training cases, (simulation) scenarios in which 
a program is executed test-wise, must be used to find out whether it is suitable for the 
given problem. Software Testing is a very important field in software engineering [168, 664, 
784, 1090] . The core problem of testing programs for their functionality and performance is 
the size of the input space. Assume that we pursued the evolution of a program that takes 
two integer numbers (32 bit) as input and computes another one as output. For testing 
this program with all possible inputs, 2 32 * 2 32 = 2 64 = 18 4 4 6 744 7 3 709 55 1 6 1 6 single 
tests would be required. Even if each test run took only lus, exhaustive testing would take 
approximately 584 542 years. In most practical applications, the input space is much larger. 
Exhaustive testing of the evolved algorithms is thus not feasible in virtuall all cases. 

Instead, we can only pick a very small fraction of the possible test scenarios for training 
and hope that they will provide significant results. The probability that this will happen 
depends very much on the method with which the training cases are selected. 

Most often, one cannot be sure whether evolved behavioral patterns (or algorithms) are 
perfect and free from errors in all possible situations. Here, nature indeed has the same 
problem as the noble-minded scientists who apply Genetic Programming, as the following 
small analogy 75 will show. 

The Monkey and the Orange Consider a certain scheme in the behavior of monkeys. If a 
monkey sees or smells something delicious in, for example, a crevice, it sticks its hand in, 
grabs the item of desire and pulls it out. This simple behavior itself is quite optimal and has 
served the monkeys well for generations. With the occurrence of homo sapiens, the situation 
changed. African hunters still use this behavior against the monkeys by creating a situation 
that was never relevant during its "evolutionary testing period" : They slice a coconut in half 
and put a hole in one side just big enough for a monkey's hand to fit through. Now they 
place an orange between the two coconut halves, tie them closely together and secure the 
trap with a rope to a tree. Sooner or later, a monkey will smell the orange, find the coconut 
with the hole, stick its hand inside and grab the fruit. However, with the orange in its fist, it 
cannot pull the hand out anymore. The hunter can now easily catch the monkey, to whom 
it never occurs that it could let go the fruit and save its life. 

In other words, although evolutionary algorithms like Genetic Programming may provide 
good solutions for many problems, their results still need to be analyzed and interpreted by 
somebody at least a bit more cunning than an average monkey. 

This implies that the solutions need to be delivered in a human-readable way. In some 
optimization problems, we may, however, choose a representation for the solution candidates 
which is very hard to understand for human beings but more suitable for evolution. Hence, 
it is not always possible to perform a sanity check on the evolved programs by hand. 

75 It is hard to find references confirming this story. It occurred in one scene of the movie Animals 
are beautiful people by Uys [2085], roughly resembles one of Aesop's fables [14], and is men- 
tioned on the Wikipedia [2219] page on coconuts from 2007-07-21 (http://en.wikipedia.org/ 
w/index .php?title=Cocomit&oldid=146038518) where they had the same problem and eventu- 
ally removed the corresponding text. But regardless whether it is just an urban legend or not - 
it is a nice story. 
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Halting Problem 

The Halting Problem is basically an instance of the Entscheidungsproblem and asks for an 
algorithm that decides whether another algorithm will terminate at some point in time or 
runs forever if provided with a certain, finite input. Again, Turing [2065, 2066] proved that 
a general algorithm solving the Halting Problem cannot exist in general. One possible way 
to show this is to use a simple counter-example: Assume that a correct algorithm doesHalt 
exists (as presumed in Algorithm 4.2) which takes a program algo as input and determines 
whether it will terminate or not. It is now possible to specify a program trouble which, in 
turn, uses doesHalt to determine if it will halt at some point in time. If doesHalt returns 
true, trouble loops forever. Otherwise it halts immediately. In other words, doesHalt cannot 
return the correct result for trouble and hence, cannot be applied universally. Thus, it is 
not possible to solve the Halting Problem algorithmically for Turing-complete programs in 
a Turing-complete representation. One consequence of this fact is that there are no means 
to determine when an evolved program will terminate or whether it will do so at all (if 
its representation allows infinite execution, that is) [2011, 2254]. Langdon and Poli [1243] 
have shown that in Turing-complete linear Genetic Programming systems, most synthesized 
programs loop forever and the fraction of halting programs of size length is proportional to 
^length, i. e., small. 



Algorithm 4.2: Halting Problem: reductio ad absurdum 



l begin 



doesHalt(aZgo) £ {true, false} 
begin 

end 

Subalgorithm trouble() 
begin 

if doesHalt (trouble) then 
while true do 

L ■•• 



end 



12 end 



Countermeasures 

Against the Entscheidungsproblem 

For general, Turing-complete program representations, neither exhaustive testing nor algo- 
rithmic detection of correctness is possible. 

Model Checking Model checking 11 ' techniques [413, 1483] have made great advance since 
the 1980s. According to Clarke and Emerson [412], "Model checking is an automated tech- 
nique that, given a finite-state model of a system and a logical property, systematically checks 
whether this property holds for (a given initial state in) that model. " The result of the check- 
ing process is either a confirmation of the correctness of the checked model, a counterexample 
in which it fails to obey its specification, or failure, i. e., a situation in which no conclusion 
could be reached. 

Hence, in the context of Genetic Programming, a model checker can be utilized as a 
Boolean function ip : X B which maps the evolved programs to correct (= true) or 



http : //en . wikipedia . org/ wiki/Model_checking [accessed 2008-10-02] 
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incorrect (= false). As objective function, ip therefore is rather infeasible, since it would 
lead directly to the all-or-nothing problem discussed in Section 4. 10. 2. 77 

Still, model checkers can be an interesting way to define termination criteria for the 
evolution or to verify its results. This may require a reduction of the expressiveness of the 
GP approaches utilized in order to make them compliant with the input languages of the 
model checkers. Then again, there are very powerful model checkers such as SPIN 78 [955, 
176, 256], which processes systems written in the Promela 79 (the Process Meta Language) 
with which asynchronous distributed algorithms can be specified [175]. If such a system was 
used, no reduction of the expressiveness of the program representation would be needed at 
all. Nevertheless, a formal transformation of the GP representation to these languages must 
be provided in any circumstance. Creating such a transformation is complicated and requires 
a formal proof of correctness - checking a model without having shown the correctness of 
the model representation first is, basically, nonsense. 80 

The idea of using model checkers like SPIN is very tempting. One important drawback 
of this method is the unforeseeable runtime of the checking process which spans from almost 
instantaneous return up to almost half an hour [2031]. In the same series of experiments 
([2031]), the checking process also failed in a fraction of cases (« 18%) depending on the 
problem to be verified. Especially the unpredictable runtime for general problems led us to 
the decision to not use SPIN in our own works yet, since in the worst case, a few thousand 
program verifications could be required per generation in the GP system. Still, it is an 
interesting idea to evolve programs in Promela language and we will reconsider it in our 
future work and evaluate the utility and applicability of SPIN for the said purposes in 
detail. 

Functional Adequacy In the face of this situation where we cannot automatically determine 
whether an evolved algorithm is correct, overfittcd, or oversimplified, a notation for which 
solutions are acceptable and which are not is required. One definition which fits perfectly in 
this context is the idea of functional adequacy provided by Camps et al. [327], Gleizes et al. 
[809]: 

Definition 4.7 (Functional Adequacy). When a system has the "right" behavior - 
judged by an external observer knowing the environment - we say that it is functionally 
adequate [809]. 

In the context of Genetic Programming, the external observer is represented by the 
objective functions which evaluate the behavior of the programs in the simulation environ- 
ments. According to Gleizes et al. [809], functional adequacy also subsumes non-functional 
criteria such as memory consumption or response time if they become crucial in a certain en- 
vironment, i.e., influence the functionality. For optimizing such criteria, different additional 
approaches are provided in Section 4.10.3. 

Against The Halting Problem 

In order to circumvent the Halting Problem, the evolved programs can be executed in sim- 
ulations which allow limiting their runtime [2254, 1027]. Programs which have not finished 
until the time limit has elapsed are terminated automatically. Especially in linear Genetic 
Programming approaches, it is easy to do so by simply defining an upper bound for the 
number of instructions being executed. For tree-based representations, this is slightly more 
complicated. 

Teller [2011] suggests to apply time- limiting approaches too, but also the use of so-called 
anytime algorithms, i. e., algorithms that store their best guess of the result in a certain 

77 One approach to circumvent this problem would be to check for several properties separately. 

78 http://en.wikipedia.org/wiki/SPIN_model_checker [accessed 2008-10-02] 

79 http://en.wikipedia.org/wiki/Promela [accessed 2008-10-02] 

80 Thanks to Hendrik Skubch for discussing this issue with me. 
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memory cell and update it during their run. Anytime algorithms can be stopped at any 
time, since the result is always there, although it would have been refined in the future 
course of the algorithm. 

Another way to deal with this problem is to prohibit the evolution of infinite loops or 
recursions from the start by restricting the structural elements in the programming language. 
If there are no loops, there surely cannot be infinite ones either. Imposing such limitations, 
however, also restricts the programs that can evolve: A representation which does not allow 
infinite loops cannot be Turing-complete either. 




Figure 4.34: A sketch of an infinite message loop. 



Often it is not sufficient to restrict just the programming language. An interesting ex- 
ample for this issue is the evolution of distributed algorithms. Here, the possible network 
situations and the reactions to them would also need to be limited. One would need to 
exclude situations like the one illustrated in Figure 4.34 where 

1. node A sends message X to node B which 

2. triggers an action there, leading to a response message Y from B back to node A which, 
in turn, 

3. causes an action on A that includes sending X to B again 

4. and so on. . . 

Preventing such a situation is even more complicated and will, most likely, also prevent the 
evolution of useful solutions. 

4.10.2 All-Or-Nothing? 

The evolution of algorithms often proves as a special instance of the needle-in-a-haystack 
problem. From a naive and, at the same time, mathematically precise point of view, an 
algorithm computing the greatest common divisor of two numbers, for instance, is either 
correct or wrong. Approaching this problem straightforwardly leads to the application of a 
single objective function which can take on only two values, provoking the all-or-nothing 
problem in Genetic Programming. In such a fitness landscape, a few steep spikes of equal 
height represent the correct algorithms and are distributed over a large plane of infeasible 
solution candidates with equally bad fitness. 

The negative influence of all-or-nothing problems have been reported from many areas 
of Genetic Programming, such as the evolution of distributed protocols [2058] (see Sec- 
tion 23.2.2), quantum algorithms [1932], expression parsers [1027], and mathematical algo- 
rithms (such as the GCD). 

In Section 21.3.2, we show how to some means to mitigate this problem for the GCD 
evolution. However, like those mentioned in some of the previously cited works, such methods 
are normally application dependent and often cannot be transferred to other problems in a 
simple manner. 

Countermeasures 

There are two direct countermeasures against the all-or-nothing problem in GP. The first 
one is to devise objective functions which can take on as many values as possible, i.e., which 
also reward partial solutions. 
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The second countermeasure is using as many test cases as possible and applying the 
objective functions to all of them, setting the final objective values to be the average of the 
results. Testing with ten training cases will transform a binary objective function to one 
which (theoretically) can take on eleven values, for instance: 1.0 if all training cases were 
processed correctly, 0.9 if one training case failed while nine worked out properly, . . . , and 
0.0 if the evolved algorithm was unable to behave adequately in any of the training cases. 
Using multiple training cases has, of course, the drawback that the time needed for the 
objective function evaluation will increase (linearly). 

Vaguely related to these two measures is another approach, the utilization of Lamarck- 
ian evolution [522, 2215] or the Baldwin effect [123, 929, 930, 2215] (see Section 15.2 and 
Section 15.3, respectively). As already pointed out in Section 1.4.3, they incorporate a local 
search into the optimization process which may further help to smoothen out the fitness 
landscape [864]. 

In our experiments reported in [2177], an approach similar to Lamarckian evolution was 
incorporated. Although providing good results, the runtime of the approaches increased to 
a degree rendering it unfeasible for large-scale. 81 

4.10.3 Non- Functional Features of Algorithms 

Besides evaluating an algorithm in terms of its functionality, there always exists a set of non- 
functional features that should be regarded too. For most non-functional aspects (such as 
code size, runtime requirements, and memory consumption) and the parsimony* 2 principle 
holds: less is better. In this section, we will discuss various reasons for applying parsimony 
pressure in Genetic Programming. 

Code Size 

In Section 30.1.1 on page 547, we define what algorithms are: compositions of atomic in- 
structions that, if executed, solve some kind of problem or a class of problems. Without 
specifying any closer what atomic instructions are, we can define the following: 

Definition 4.8 (Code Size). The code size of an algorithm or program is the number of 
atomic instructions it is composed of. 

The atomic instructions cannot be broken down into smaller pieces. Therefore, the code 
size is a positive integer number in No- Since algorithms are statically finite per definition 
(see Definition 30.9 on page 550), the code size is always finite. 

Code Bloat 

Definition 4.9 (Bloat). In Genetic Programming, bloat is the uncontrolled growth in size 
of the individuals during the course of the evolution [1318, 229, 140, 1196, 1241]. 

The term code bloat is often used in conjunction with code introns, which are regions 
inside programs that do not contribute to the functional objective values (because they 
can never be reached, for instance; see Definition 3.2 on page 146). Limiting the code size 
and increasing the code efficiency by reducing the number of introns is an important task 
in Genetic Programming since disproportionate program growth has many bad side effects 
like: 

1. The evolving programs become unnecessarily big while elegant solutions should always 
be as small and simple as possible. 

81 These issues were not the subject of the paper and thus, not discussed there. 

82 http://en.wikipedia.org/wiki/Parsimony [accessed 2008-10-14] 
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2. Mutation and recombination operators always have to select the point in an individual 
where they will apply their changes. If there are many points that do not contribute 
to functionality, the probability of selecting such a point for modification is high. The 
generated offspring will then have exactly the same functionality as its parents and the 
genetic operation performed was literarily useless. 

3. Bloat slows down both, the evaluation [872] and the breeding process of new solution 
candidates. 

4. Furthermore, it leads to increased memory consumption of the Genetic Programming 
system. 

There are many theories about how code bloat emerges [1318], some of them are: 

1. Unnecessary code hitchhikes with good individuals. If it is part of a fit solution candidate 
that creates many offspring, it is likely to be part of many new individuals. According 
to Tackett [1994], high selection pressure is thus likely to cause code growth. This idea 
is supported by the research of Langdon and Poli [1241], Smith and Harries [1906], and 
Gustafson et al. [872]. 

2. As already stated, unnecessary code makes it harder for genetic operations to alter the 
functionality of an individual. In most cases, genetic operators yield offspring with worse 
fitness than its parents. If a solution candidate has good objective values, unnecessary 
code can be one defense method against recombination and mutation. If the genetic 
operators are neutralized, the offspring will have the same fitness as its parent. This idea 
has been suggested in many sources, such as [229, 228, 1544, 1384, 1756, 140, 1244, 1906]. 
From this point of view, introns are a "bad" form of neutrality 8 ' 5 . By the way, the 
reduction of the destructive effect of recombination on the fitness may also have positive 
effects, as pointed out by Nordin et al. [1546, 1547], since it may lead to a more durable 
evolution. 

3. Luke [1318] defines a theory for tree growth based on the fact that recombination is 
likely to destroy the functionality of an individual. However, the deeper the crossover 
point is located in the tree, the smaller is its influence because fewer instructions are 
removed. If only a few instructions are replaced from a functionally adequate program, 
they are likely to be exchanged by a larger sub-tree. A new offspring that retains the 
functionality of its parents therefore tends to be larger. 

4. Similar to the last two theories, the idea of removal bias by Soulc and Foster [1922] 
states that removing code from an individual will preserve the individual's functionality 
if the code removed is non-functional. Since the portion of useless code inside a program 
is finite, there also exists an upper limit of the amount of code that can be removed 
without altering the functionality of the program. For the size of new sub-trees that 
could be inserted instead (due to mutation or crossover), no such limit exists. Therefore, 
programs tend to grow [1922, 1244]. 

5. According to the diffusion theory of Langdon et al. [1244], the number of large pro- 
grams in the problem space that are functionally adequate is higher than the number 
of small adequate programs. Thus, code bloat could correspond to the movement of the 
population into the direction of equilibrium [1318]. 

6. Another theory considers the invalidators that make code unreachable or dysfunctional. 
In the formula 4 + * (4 — x) for example, the multiplication with makes the whole part 
(4 — x) inviable. Luke [1318] argues that the influence of invalidators would be higher 
in large trees than in small trees. If programs grow while the fraction of invalidators 
remains constant and those inherited from the parents stay in place, their chance to occur 
proportionally closer to the root increases. Then, the amount of unnecessary instructions 
would increase too and naturally approach 100%. 

7. Instead of being real solutions, programs that grow uncontrolled also tend to be some 
sort of decision tables. This phenomenon is called overfitting and has already discussed 



You can find the topic of neutrality discussed in Section 1.4.5 on page 64. 
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in Section 1.4.8 on page 72 and Section 23.1.3 on page 399 in detail. The problem is that 
overfitted programs tend to have marvelous good fitness for the training cases/sample 
data, but are normally useless for any other input. 
8. Like Tackett [1994], Gustafson et al. [872] link code growth to high selection pressure 
but also to loss of diversity in general. In populations with less diversity, recombination 
will frequently be applied to very similar individuals, which often yields slightly larger 
offspring. 

Some approaches for fighting bloat are discussed in Section 4.10.3. 
Runtime and Memory Consumption 

Another aspect subject to minimization is generally the runtime of the algorithms grown. 
The amount of steps needed to solve a given task, i. e., the time complexity, is only loosely 
related to the code size. Although large programs with many instructions tend to run longer 
than small programs with few instructions, the existence of loops and recursion invalidates 
a direct relation. 

Like the complexity in time, the complexity in memory space of the evolved solutions 
often is minimized, too. The number of variables and memory cells needed by program in 
order to perform its work should be as small as possible. Section 30.1.3 on page 550 provides 
some additional definitions and discussion about the complexity of algorithms. 

Errors 

An example for an application where the non-functional errors that can occur should be min- 
imized is symbolic regression. Therefore, the property of closure specified in Definition 4.1 on 
page 178 is usually ensured. Then, the division operator div is re-defined in order to prevent 
division- by-zero errors. Therefore, such a division could either be rendered to a nop (i. e., 
does nothing) or yields 1 or the dividend as result. However, the number of such arithmetical 
errors could also be counted and made the subject to minimization too. 

Transmission Count 

If evolving distributed algorithms, the number of messages required to solve a problem 
should be as low as possible since transmissions are especially costly and time-consuming 
operations. 

Optimizing Non-Functional Aspects 

Optimizing the non-functional aspects of the individuals evolved is a topic of scientific in- 
terest. 

1. One of the simplest means of doing so is to define additional objective functions which 
minimize the program size and to perform a multi-objective optimization. Successful 
and promising experiments by Bleuler et al. [227], de Jong et al. [510], and Ekart and 
Nemeth [626] showed that this is a viable countermeasure for code bloat, for instance. 

2. Another method is limiting the aspect of choice. A very simple measure to limit code 
bloat, for example, is to prohibit the evolution of trees with a depth surpassing a certain 
limit [1320]. 

3. Poli [1660] furthermore suggests that the fitness of a certain portion of the population 
with above-average code size should simply be set to the worst possible value. These 
artificial fitness holes will repel the individuals from becoming too large and hence, 
reduce the code bloat. 
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Evolution Strategy 



5.1 Introduction 

Evolution Strategies 1 (ES) introduced by Rechenberg [1712, 1713, 1714] are a heuristic 
optimization technique based in the ideas of adaptation and evolution, a special form of 
evolutionary algorithms [1712, 1713, 1714, 103, 200, 1841, 198, 916]. Evolution Strategies 
have the following features: 

1. They usually use vectors of real numbers as solution candidates, i. e., G = X = K n . In 
other words, both the search and the problem space are fixed-length strings of floating 
point numbers, similar to the real-encoded genetic algorithms mentioned in Section 3.3 
on page 145. 

2. Mutation and selection are the primary operators and recombination is less common. 

3. Mutation most often changes the elements x[»] of the solution candidate vector x 
to a number drawn from a normal distribution N(x[i],af). For reference, you can 
check Equation 11.1 on page 259 in the text about Random Optimization. 

4. Then, the values Oi are governed by self-adaptation [891, 1400, 1214] such as covariance 
matrix adaptation [888, 889, 890, 1041]. 

5. In all other aspects, they perform exactly like basic evolutionary algorithms as defined 
in Algorithm 2.1 on page 99. 



5.2 General Information 
5.2.1 Areas Of Application 

Some example areas of application of Evolution Strategy are: 

Application References 

Data Mining and Data Analysisanalysis [445] 

Scheduling [971] 

Chemistry, Chemical Engineering [1755, 470, 632] 

Ressource Minimization, Environment Surveillance/Pro- ^555] 
tection 

Combinatorial Optimization [1536, 193, 197] 

Geometry and Physics [1122, 2173] 

Optics and Image Processing [859, 860, 101, 2218, 2217, 1279] 



1 http : //en. wikipedia. org/wiki/Evolution_strategy [accessed 2007-07-03], http://www. 
scholarpedia. org/ article/Evolution_Strategies [accessed 2007-07-03] 
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5.2.2 Conferences, Workshops, etc. 



Some conferences, workshops and such and such on Evolution Strategy are: 



EUROGEN: Evolutionary Methods for Design Optimization and Control with Applications 

to Industrial Problems 
see Section 2.2.2 on page 106 



5.2.3 Books 

Some books about (or including significant information about) Evolution Strategy are: 

Schwefel [1841]: Evolution and Optimum Seeking: The Sixth Generation 

Rechenberg [1713]: Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien 

der biologischen Evolution 

Rechenberg [1714]: Evolutionsstrategie '94 

Beyer [198]: The theory of evolution strategies 

Schwefel [1840]: Numerical Optimization of Computer Models 

Schoneburg, Hcinzmann, and Feddersen [1831]: Genetische Algorithmen und Evolution- 

Back [99]: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolution- 
ary Programming, Genetic Algorithms 



5.3 Populations in Evolution Strategy 

Evolution Strategies usually combine truncation selection (as introduced in Section 2.4.2 on 
page 122) with one of the following population strategies. These strategies listed below have 
partly been borrowed from German Wikipedia [2219] site for Evolution Strategy 2 . 

5.3.1 (1 + 1)-ES 

The population only consists of a single individual which is reproduced. From the elder and 
the offspring, the better individual will survive and form the next population. This scheme 
is very close to hill climbing which will be introduced in Chapter 10 on page 253. 

5.3.2 (/x + 1)-ES 

Here, the population contains /i individuals from which one is drawn randomly. This indi- 
vidual is reproduced from the joint set of its offspring and the current population, the least 
fit individual is removed. 

5.3.3 (n + A)-ES 

Using the reproduction operations, from /x parent individuals A > ii offspring are created. 
From the joint set of offspring and parents, only the /z fittest ones are kept [936]. 

2 http://de.wikipedia.org/wiki/Evolutionsstrategie [accessed 2007-07-03] 
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5.3.4 (/x,A)-ES 

In {fi, A) Evolution Strategies, introduced by Schwefel [1840], again A > p children arc 
created from p parents. The parents are subsequently deleted and from the A offspring 
individuals, only the p fittest are retained [1840, 196]. 

5.3.5 (/x/p,A)-ES 

Evolution Strategies named {p/ p, A) are basically (p, A) strategies. The additional parameter 
p is added, denoting the number of parent individuals of one offspring. As already said, 
normally, we only use mutation (p = 1). If recombination is also used as in other evolutionary 
algorithms, p = 2 holds. A special case of (p/p,X) algorithms is the (p/p,X) Evolution 
Strategy [1369]. 

5.3.6 (p/p + A)-ES 

Analogously to (p/p, A)-Evolution Strategies, the (p/p + A)-Evolution Strategies are (p, A) 
approaches where p denotes the number of parents of an offspring individual. 

5.3.7 (p', A'(p, A)->>ES 

Geyer et al. [791, 792, 793] have developed nested Evolution Strategies where A' offspring 
are created and isolated for 7 generations from a population of the size p! . In each of the 
7 generations, A children are created from which the fittest p are passed on to the next 
generation. After the 7 generations, the best individuals from each of the 7 isolated solution 
candidates propagated back to the top-level population, i.e., selected. Then, the cycle starts 
again with A' new child individuals. This nested Evolution Strategy can be more efficient than 
the other approaches when applied to complex multimodal fitness environments [1714, 793]. 

5.4 One-Fifth Rule 

The g success rule defined by Rechenberg [1713] states that the quotient of the number of 
successful mutations (i. e., those which lead to fitness improvements) to the total number 
of mutations should be approximately |. If the quotient is bigger, the a- values should be 
increased and with that, the scatter of the mutation. If it is lower, o should be decreased 
and thus, the mutations are narrowed down. 

5.5 Differential Evolution 
5.5.1 Introduction 

Differential Evolution ' (DE, DES) is a method for mathematical optimization of multidi- 
mensional functions that belongs to the group of evolution strategies [1676, 653, 1404, 288, 
1234, 1391, 189]. Developed by Storn and Price [1974], the DE technique has been invented 
in order to solve the Chebyshev polynomial fitting problem. It has proven to be a very reli- 
able optimization strategy for many different tasks where parameters that can be encoded 
in real vectors. 

The essential idea behind Differential Evolution is the way the (ternary) recombination 
operator "deRecombination" is defined for creating new solution candidates. The difference 

3 http://en.wikipedia.org/wiki/Differential_evolution [accessed 2007-07-03], http://www.icsi. 
berkeley.edu/-storn/code.html [accessed 2007-07-03] 
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xi — X2 of two vectors Xi and X2 in X is weighted with a weight net and added to a third 
vector x 3 in the population. 

x = deRecombination(xi, x 2 , x 3 ) => x = x 3 + w (xi — x 2 ) (5.1) 

Except for determining w, no additional probability distribution has to be used and the Dif- 
ferential Evolution scheme is completely self-organizing. This classical reproduction strategy 
has been complemented with new ideas like triangle mutation and alternations with weighted 
directed strategies. 

Gao and Wang [770] emphasize the close similarities between the reproduction operators 
of Differential Evolution and the search step of the downhill simplex. Thus, it is only logical 
to combine or to compare the two methods (see Section 16.4 on page 286). Further improve- 
ments to the basic Differential Evolution scheme have been contributed, for instance, by 
Kaelo and Ali. Their DERL and DELB algorithms outperformed [1078, 1079, 1077] stan- 
dard DE on the test benchmark from Ali et al. [38]. 



5.5.2 General Information 
Areas Of Application 



Some example areas of application of Differential Evolution are: 



Application References 

Engineering, Structural Optimization, and Design [1233, 1506] 

Chemistry, Chemical Engineering [2148, 1846, 2052, 399] 

Scheduling [1289] 

Function Optimization [1972] 

Electrical Engineering and Circuit Design [1971, 1973] 



Journals 



Some journals that deal (at least partially) with Differential Evolution are: 



Journal of Heuristics (see Section 1.6.3 on page 91) 



Books 



Some books about (or including significant information about) Differential Evolution are: 



Price, Storn, and Lampinen [1676]: Differential Evolution - A Practical Approach to Global 
Optimization 

Fcoktistov [653] : Differential Evolution - In Search of Solutions 

Corne, Dorigo, Glover, Dasgupta, Moscato, Poli, and Price [448]: New Ideas in Optimisation 
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6.1 Introduction 

Different from the other major types of evolutionary algorithms introduced, there exists 
no clear specification or algorithmic variant for evolutionary programming 1 (EP) to the 
knowledge of the author. There is though a semantic difference: while single individuals of a 
species are the biological metaphor for solution candidates in other evolutionary algorithms, 
in evolutionary programming, a solution candidate is thought of as a species itself. 2 Hence, 
mutation and selection are the only operators used in EP and recombination is usually not 
applied. The selection scheme utilized in evolutionary programming is normally quite similar 
to the (fj, + A) method in Evolution Strategies. 

Evolutionary programming was pioneered by Fogel [705] in his PhD thesis back in 1964. 
Fogel et al. [708] experimented with the evolution of finite state machines as predictors for 
data streams [623]. Evolutionary programming is also the research area of his son David 
Fogel [697, 699, 700] with whom he also published joint work [707, 1671]. 

Generally, it is hard to distinguish evolutionary programming from Genetic Program- 
ming, genetic algorithms, and Evolution Strategy. Although there are semantic differences 
(as already mentioned), the author thinks that the many aspects of the evolutionary pro- 
gramming approach have merged into these other research areas. 

6.2 General Information 
6.2.1 Areas Of Application 



Some example areas of application of evolutionary programming are: 



Application 


References 


Machine Learning 


[697] 


Cellular Automata and Finite State Machines 


[708] 


Evolving Behaviors, e.g., for Agents or Game Players 


[699, 700] 


Machine Learning 


[1671] 


Chemistry, Chemical Engineering and Biochemistry 


[779, 609, 778] 


Electrical Engineering and Circuit Design 


[1135, 1518] 


Data Mining and Data Analysis 


[1802] 


Robotics 


[1136] 



http : //en. wikipedia. org/wiki/Evolutionary_programming [accessed 2007-07-03] 
2 In this aspect it is very similar to the much newer Extremal Optimization approach which will 
be discussed in Chapter 13. 
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6.2.2 Conferences, Workshops, etc. 



Some conferences, workshops and such and such on evolutionary programming are: 



EP: International Conference on Evolutionary Programming 
now part of CEC, see Section 2.2.2 on page 105 
History: 1998: San Diego, California, USA, see [1670] 

1997: Indianapolis, Indiana, USA, see [68] 

1996: San Diego, California, USA, see [709] 

1995: San Diego, California, USA, see [1380] 

1994: see [1849] 

1993: see [702] 

1992: see [701] 

EUROGEN: Evolutionary Methods for Design Optimization and Control with Applications 

to Industrial Problems 
see Section 2.2.2 on page 106 



6.2.3 Books 

Some books about (or including significant information about) evolutionary programming 
are: 

Fogel, Owens, and Walsh [708]: Artificial Intelligence through Simulated Evolution 
Fogel [706] : Intelligence Through Simulated Evolution: Forty Years of Evolutionary Program- 
ming 

Fogel [697]: System Identification through Simulated Evolution: A Machine Learning Ap- 
proach to Modeling 

Fogel [700]: Blondie24-' playing at the edge of AI 

Back [99]: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolution- 
ary Programming, Genetic Algorithms 
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Learning Classifier Systems 



7.1 Introduction 

In the late 1970s, Holland, the father of genetic algorithms, also invented the concept of 
classifier systems (CS) [948, 941, 946]. These systems are a special case of production systems 
[497, 498] and consist of four major parts: 

1. a set of interacting production rules, called classifiers, 

2. a performance algorithm which directs the actions of the system in the environment, 

3. a learning algorithm which keeps track on the success of each classifier and distributes 
rewards, and 

4. a genetic algorithm which modifies the set of classifiers so that variants of good classifiers 
persist and new, potentially better ones are created in an efficient manner [947]. 

By time, classifier systems have undergone some name changes. In 1986, reinforcement 
learning was added to the approach and the name changed to Learning Classifier Systems 1 
(LCS) [916, 1909]. Learning Classifier Systems are sometimes subsumed under a machine 
learning paradigm called evolutionary reinforcement learning (ERL) [916] or Evolutionary 
Algorithms for Reinforcement Learning (EARLs) [1460]. 



7.2 General Information 
7.2.1 Areas Of Application 



Some example areas of application of Learning Classifier Systems are: 

Application References 

Data Mining and Data Analysisg [768, 92, 479, 444, 2178] 

Grammar Induction [2073, 2074, 472] 

Medicine [951] 

Image Processing [1287, 1376] 

Sequence Prediction [1736] 



7.2.2 Conferences, Workshops, etc. 

Some conferences, workshops and such and such on Learning Classifier Systems are: 
1 http://en.wikipedia.org/wiki/Learning_classifier_system [accessed 2007-07-03] 
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IWLCS: International Workshop on Learning Classifier Systems 
Nowadays often co-located with GECCO (see Section 2.2.2 on page 107). 
History: 2007: London, England, see [1946] 

2006: Seattle, WA, USA, see [1847] 

2005: Washington DC, USA, see [2157, 1181] 

2004: Seattle, Washington, USA, see [1848, 1181] 

2003: Chicago, IL, USA, see [2022, 1181] 

2002: Granada, Spain, see [1254] 

2001: San Francisco, CA, USA, see [1944] 

2000: Paris, France, see [1253] 

1999: Orlando, Florida, USA, see [1585] 

1992: Houston, Texas, USA, see [1501] 



7.2.3 Books 

Some books about (or including significant information about) Learning Classifier Systems 
are: 

Bull [301]: Applications Of Learning Classifier Systems 

Bull and Kovacs [303] : Foundations of Learning Classifier Systems 

Butz [314]: Anticipatory Learning Classifier Systems 

Butz [315]: Rule-Based Evolutionary Online Learning Systems: A Principled Approach to 
LCS Analysis and Design 

Lanzi, Stolzmann, and Wilson [1252]: Learning Classifier Systems, From Foundations to 
Applications 



7.3 The Basic Idea of Learning Classifier Systems 

Figure 7.1 illustrates the structure of a Michigan-style Learning Classifier System. A classifier 
system is connected via detectors (b) and effectors (c) to its environment (a). The input 
in the system (coming from the detectors) is encoded in form of binary messages that are 
written into a message list (d). On this list, simple if-then rules (e), the so-called classifiers, 
are applied. The result of a classification is again encoded as a message and written to the 
message list. These new messages may now trigger other rules or are signals for the effectors 
[507] . The payoff of the performed actions is distributed by the credit apportionment system 
(f) to the rules. Additionally, a rule discovery system (g) is responsible for finding new rules 
and adding them to the classifier population [794] . 

Classifier systems are special instances of production systems, which were shown to be 
Turing-complete by Post [1672] and Minsky [1427, 1426]. Thus, Learning Classifier Systems 
are as powerful as any other Turing-equivalent programming language and can be pictured 
as something like computer programs where the rules play the role of the instructions and 
the messages are the memory. 

7.3.1 A Small Example 

In order to describe how rules and messages are structured in a basic classifier systems, we 
borrow a simple example from Heitkotter and Beasley [916]. We will orient our explanation 
at the syntax described by Geyer-Schulz [794]. You should, however, be aware that there 
are many different forms of classifier system and take this as an example for how it could be 
done rather than as the way it is to be done. 

So let us imagine that we want to find a classifier system that is able to control the 
behavior of a frog. Our frog likes to eat nutritious flies. Therefore, it can detect small, 
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payoff 



(a) Environment 



information 



action 



(b) Detectors 



(c) Effectors 



^ (d) Message List 



l_T 




Non- Learning Classifier System, 
Production System 



(f) Apportionment of 
( 'redit System 
(e.g. Bucket Brigade) 



(g) Rule Discovery 
System 

(e.g. Genetic Algorithm) 



Learning ( llassifier System 



Figure 7.1: The structure of a Michigan style Learning Classifier System according to Geyer- 
Schulz [794]. 



flying objects and eat them if they are right in front of it. The frog also has a sense of 
direction and can distinguish between objects which are in front, to the left, or to the right 
of it and may also turn into any of these directions. It can furthermore distinguish objects 
with stripes from those without. Flying objects with stripes are most likely bees or wasps, 
eating of which would probably result in being stung. The frog can also sense large, looming 
objects far above: birds, which should be avoided by jumping away quickly. We can compile 
a corresponding behavior into the form of simple if-then rules which are listed in Table 7.1. 



No. 


premise (if-part) 




conclusion (then- part) 


1 


small, flying object with 


no stripes to the left 


send a 


2 


small, flying object with 


no stripes to the right 


send b 


3 


small, flying object with 


no stripes to the front 


send c 


4 


large, looming object 




send d 


5 


a and not d 




turn left 


6 


b and not d 




turn right 


7 


c and not d 




eat 


8 


d 




move away rapidly 



Table 7.1: if-then rules for frogs 
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■type 



small 
large 

0=flying 
l=looming 



f01=left 
■ direction \ 10=center 
[ ll=right 



■ stripes 



0=without 
l=with 
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detector input 



. memory . 



001=a ' 

010=b 

011=c 

100=d J 
00=don't turn ] 
01=left kurn- 
10=right J 

1— vps; I J P 



=yes 

0=no 
l=yes 



eat- 




Figure 7.2: One possible encoding of messages for a frog classifier system 



In Figure 7.2, we demonstrate how the messages in a classifier system that drives such a 
frog can be encoded. Here, input information as well as action commands (the conclusions 
of the rules) are compiled in one message type. Also, three bits are assigned for encoding 
the internal messages a to d. Two bits would not suffice, since 00 occurs in all "original" 
input messages. At the beginning of a classification process, the input messages are written 
to the message list. They contain information only at the positions reserved for detections 
and have zeros in the bits for memory or actions. The classifiers transform them to internal 
messages which normally have only the bits marked as "memory" set. These messages are 
finally transformed to output messages by setting some action bits. In our frog system, a 
message is in total k = 12 bits long, i. e., len(m) = 12 V 'message m. 

7.3.3 Conditions 

Rules in classifier systems consist of a condition part and an action part. The conditions 
have the same length k as the messages. Instead of being binary encoded strings, a ternary 
system consisting of the symbols 0, 1, and * is used. In a condition, 

1. means that the corresponding bit in the message must be 0, 

2. 1 means that the corresponding bit in the message must be 1, and 

3. * means don't care, i. e., the corresponding bit in the message may be as well as 1 for 
the condition to match. 

Definition 7.1 (match). A message m matches to a condition c if match(m, c) evaluates 
to true. 

match(m,c) = V0 < i < \m\ m[i] = c[»] V c[i] = * (7.1) 

The conditional part of a rule may consist of multiple conditions which are implicitly 
concatenated with logical and (A). A classifier is satisfied if all its conditions are satisfied 
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by at least one message in the current message list. It is allowed that each of the conditions 
of a classifier may match to different messages. 

We can precede each single condition c with an additional ternary digit which defines 
if it should be negated or not: * stands for the negation c and as well as 1 denotes c. 
Here we deviate from the syntax described in Geyer-Schulz [794] because the definition of 
the "conditionSpecifity" (see Definition 7.2) becomes more beautiful this way. A negated 
condition evaluates to true if no message exists that matches it. By combining and and not, 
we get nands with which we can build all other logic operations and, hence, whole computers 
[2045]. Algorithm 7.1 illustrates how the condition part C is matched against the message 
list M. If the matching is successful, it returns the list S of messages that satisfied the 
conditions. Otherwise, the output will be the empty list (). 



Algorithm 7.1: S < — matchcsConditions(M, C) 



Input: 
Input: 
Input: 
Input: 



M: the message list 
C : the condition part of a classifier 

[implicit] k: the length of the messages m G M and the single conditions c £ C 
[implicit] havePrefix: true if and only if the single conditions have a prefix which 
determines whether or not they are negated, false if no such prefixes are used 
Data: i: a counter variable 
Data: c: a condition 

Data: neg: should the condition be negated? 
Data: m: a single message from M 
Data: b: a Boolean variable 

Output: S: the messages that match the condition part C, or () if none such message exists 

1 begin 

2 S^Q 

3 b < true 

4 i< 

5 while (i < len(C)) A b do 

6 if havePrefix then 

7 neg < — (C[i\ = *) 

8 i < — i + 1 

9 else neg < false 

10 c < — subList(C", i, k) 

11 i < i + k 

12 if 3m € M : matchfm, c) then 

13 b < — neg 

14 if b then S < — addListItem(5', m) 

15 else 

16 b < — neg 

17 if b then S < — addListItem(5', createList(fc, 0)) 

18 if b then return S 

19 else return () 

20 end 



Definition 7.2 (Condition Specificity). The condition specificity conditionSpecifity (x) 
of a classifier x is the number of non-* symbols in its condition part C(x). 

conditionSpccifity(.T) = | {Vi : C(x) [i] ^ *} | (7.2) 

A classifier (rule) x\ with a higher condition specificity is more specific than an- 
other rule xi with a lower condition specificity. On the other hand, a rule x 2 with 
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conditionSpecifity(a;i) > conditionSpecifity^) is more general than the rule x\. We can 
use this information if two rules match to one message, and only one should be allowed to 
post a message. Preferring the more specific rule in such situations leads to default hierar- 
chies [949, 1737, 1739, 1908] which allows general classifications to "delegate" special cases 
to specialized classifiers. Even more specialized classifiers can then represent exceptions to 
these refined rules. 

7.3.4 Actions 

The action part of a rule has normally exactly the same length as a message. It can be 
represented by a string of either binary or ternary symbols. In the first case, the action part 
of a rule is simple copied to the message list if the classifier is satisfied. In the latter case, 
some sort of merging needs to be performed. Here, 

1. a in the action part will lead to a in the corresponding message bit, 

2. a 1 in the action part will lead to a 1 in the corresponding message bit, 

3. and for a * in the action part, we copy the corresponding bit from the (first) message 
that matched the classifier's condition to the newly created message. 

Definition 7.3 (merge Action). The function "mergeAction" computes a new message n 
as product of an action a. If the alphabet the action is based on is ternary and may contain *- 
symbols, mergeAction needs access to the message m which has satisfied the first condition of 
the classifier to which a belongs. If the classifier contains negation symbols and the first con- 
dition was negated, m is assumed to be a string of zeros (m = createList(len(a) ,0)). Notice 
that we do not explicitly distinguish between binary and ternary encoding in mergeAction, 
since * cannot occur in actions based on a binary alphabet and Equation 7.3 stays valid. 

n = mergeAction(a, m) <=?■ (len(n) = len(a)) A 

(n[i] = a[i] Vi e 0..1en(a) - 1 : a[i] ^ *) A 

(n[i] = m[i] Vi e 0..1en(a) - 1 : a[»] = *) (7.3) 

7.3.5 Classifiers 

So we know that a rule x consists of a condition part C(x) and an action part a(x). C 
is a list of r e N conditions c,, and we distinguish between representations with (C = 
(ni, Ci, n.2, C2, . . . , n r , c r )) and without negation symbol (C = (c\, c%, . . . , c r )). Let us now 
go back to our frog example. Based on the encoding scheme defined in Figure 7.2, we can 
translate Table 7.1 into a set of classifiers. We therefore compose the condition parts of two 
conditions C\ and c 2 with the negation symbols n\ and n 2 , i. e., r = 2. Table 7.2 contains 
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Tabic 7.2: The encoded form of the if-then rules for frogs from Table 7.1. 



the result of this encoding. We can apply this classifier to a situation in the life of our frog 
where it detects 
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1. a fly to its left, 

2. a bee to its right, and 

3. a stork left in the air. 

How will it react? The input sensors will generate three messages and insert them into the 
message list Mi = (mi, 7112,1713): 

1. mi = (000100000000) for the fly, 

2. m 2 = (001110000000) for the bee, and 

3. m 3 = (110100000000) for the stork. 

The first message triggers rule 1 and the third message triggers rule 4 whereas no condition 
fits to the second message. As a result, the new message list M 2 contains two messages, 777,4 
and m.5, produced by the corresponding actions. 

1. m 4 = (000000010000) from rule 1 and 

2. m 5 = (000001000000) from rule 4. 

7714 could trigger rule 5 but is inhibited by the negated second condition c 2 because of message 
/715. 777,5 matches to classifier 8 which finally produces message ttiq = (000000000010) which 
forces the frog to jump away. No further classifiers become satisfied with the new message 
list M 3 = (m 6 ) and the classification process is terminated. 

7.3.6 Non-Learning Classifier Systems 

So far, we have described a non-learning classifier system. Algorithm 7.2 defines the behavior 
of such a system which we also could observe in the example. It still lacks the credit ap- 
portionment and the rule discovery systems (see (f) and (g) in Figure 7.1). A non- learning 
classifier is able to operate correctly on a fixed set of situations. It is sufficient for all ap- 
plications where we are able to determine this set beforehand and no further adaptation is 
required. If this is the case, we can use genetic algorithms to evolve the classifier systems 
offline, for instance. 

Algorithm 7.2 illustrates how a classifier system works. No optimization or approximation 
of a solution is done; this is a complete control system in action. Therefore we do not need 
a termination criterion but run an infinite loop. 

7.3.7 Learning Classifier Systems 

In order to convert this non-learning classifier system to Learning Classifier System as pro- 
posed by Holland [943] and sketched in Algorithm 7.3, we have to add the aforementioned 
missing components. Hcitkotter and Beasley [916] suggest two ways for doing so: 

1. Currently, the activation of a classifier x results solely from the message-matching pro- 
cess. If a message matches the condition(s) C(x), the classifier may perform its action 
a(x). We can change this mechanism by making it also dependent on an additional pa- 
rameter v(x) - a strength value, which can be modified as a result of experience, i. e., by 
reinforcement from the environment. Therefore, we have to solve the credit assignment 
problem first defined by Minsky [1425, 1428], since chains of multiple classifiers can cause 
a certain action. 

2. Furthermore (or instead), we may also modify the set of classifiers P by adding, remov- 
ing, or combining condition/action parts of existing classifiers. 

A Learning Classifier System hence is a control system which is able to learn while 
actually running and performing its work. Usually, a training phase will precede any actual 
deployment. Afterwards, the learning may even be deactivated, which turns the LCS into 
an ordinary classifier system or the learning rate is decreased. 
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Algorithm 7.2: nonLcarningClassificrSystem(P) 



Input: P: the list of rules Xi that determine the behavior of the classifier system 

Input: [implicit] readDetectors: a function which creates a new message list containing only the 

input messages from the detectors 
Input: [implicit] sendEffectors: a function which translates all messages concerning effectors to 

signals for the output interface 
Input: [implicit] t ma x € N: the maximum number of iterations for the internal loop, avoids 

endless loops 
Data: t: a counter the internal loop 
Data: M, N, S: the message lists 
Data: x: a single classifier 

begin 

while true do 

M < — readDetectors() 
t < — 
repeat 

foreach x £ P do 

S < — matchesConditions(M, C(x)) 
if len(S) > then 
j N < — addListItem(jV, mergeAction(a(:r) , S[o})) 



l 

2 
3 
4 
5 
6 
7 
8 
9 
10 

11 
12 
13 
14 

15 end 



M < — N 
t< — t + 1 
until (len(M) = 0) V (t > t max ) 
if len(M) > then sendEffectors(M) 



7.3.8 The Bucket Brigade Algorithm 

The Bucket Brigade Algorithm has been developed by Holland [942, 943] as one method 
of solving the credit assignment problem in Learning Classifier Systems. Research work 
concerning this approach and its possible extensions has been conducted by Westerdale 
[2195, 2196, 2197], Antonisse [74], Huang [969], Riolo [1738, 1737], Dorigo [579], Spiessens 
[1942], Wilson [2234], Holland and Burks [946], and Hcwahi and Bharadwaj [922] and has 
neatly been summarized by Hewahi [920, 921]. In the following, we will outline this approach 
with the notation of de Boer [507] . 

The Bucket Brigade Algorithm selects the classifiers from the match set X that are 
allowed to post a message (i. e., becoming member in the activated set U) by an auction. 
Therefore, each matching classifier x places a bid B(x) which is the product of a linear 
function i? of the condition specificity of x, a constant < (3 < 1 that determines the fraction 
of the strength of xshould be used and its strength v(x) itself. In practical applications, values 
like ^ or jg are often chosen for (3. 

B{x) = &(x) * 0* v(x) + random„ (0, a 2 ) (7.4) 

Sometimes, a normal distributed random number is added to each bid in order to make the 
decisions of the system less deterministic, as done in Equation 7.4. 

The condition specificity is included in the bid calculation because it gives a higher value 
to rules with fewer *-symbols in their conditions. These rules match to fewer messages and 
can be considered more relevant in the cases they do match. For the quotient of the 
number non-*-symbols and the condition length plus some constant < a determining the 
importance of the specificity of the classifier is often used [507] . 

conditionSpecifity(x) 
* {X) = leniCW) +<1 (7 ' 5) 
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Algorithm 7.3: lcarningClassificrSystem() 



Input: P: the list of rules Xi that determine the behavior of the classifier system 
Input: [implicit] generateClassifiers: a function which creates randomly a population P of 
classifiers 

Input: [implicit] readDetectors: a function which creates a new message list containing only the 

input messages from the detectors 
Input: [implicit] sendEffectors: a function which translates all messages concerning effectors to 

signals for the output interface 
Input: [implicit] selectMatchingClassifiers: a function that determines at most k classifiers from 

the matching set that are allowed to trigger their actions 
Input: [implicit] generationCriterion: a criterion that becomes true if new classifiers should be 

created 

Input: [implicit] updateRules: a function that finds new rules and deletes old ones 
Input: [implicit] t m ax € N: the maximum number of iterations for the internal loop, avoids 

endless loops 
Data: t, i: counter variables 
Data: M, N, S: the message lists 

Data: X: a list of tuples containing classifiers and the (first) messages that satisfied their 

conditions 
Data: v. the strength values 
Data: x: a single classifier 



l 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 

13 
14 
15 
16 
17 
18 

19 
20 
21 
22 
23 

24 



begin 

P 



generateClassifiers ( s ) 
foreach x € P do v(x) < — 1 
while true do 

M < — readDetectorsQ 
t < — 
repeat 

foreach x £ P do 

S < — matchesConditions(M, C(x)) 
if len(S) > then 
j X < — addListItem(X, (x, S[o])) 

N < () 

if len(X) > then 

(X, v) < — selectMatchingClassifiers(X, v) 
for i « — up to len(X) — 1 do 

X < X[i,0] 

N < — addListItem(iV, mergeAction(a(a;) ,X[i,i])) 



M < — N 
t< — t + 1 
until (len(M) = 0) V (t > t max ) 
if len(M) > then 
sendEffectors(M) 
// distribute Payoffs 
if generationCriterionQ then P 



updateRules(P, v) 



25 end 
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The bucket brigade version of the selectMatchingClassifiers-function introduced in 
Algorithm 7.3 then picks the k classifiers with the highest bids and allows them to write 
their messages into the new message list. They are charged with the payment part P(x) of 
their bids. The payment does not contain the condition specificity-dependent part and also 
not the possible random addend. It is added as reward R(y) to the strength of classifier y 
that wrote the message which allowed them to become active. In the case that this was an 
input message, it is simple thrown away. The payment of classifiers that are not activated 
is null. 

P(x) =13* v(x) (7.6) 

In some Learning Classifier Systems, a life-tax T(x) is collected from all classifiers in 
each cycle. It is computed as a small fraction r of their strength. 

T(x) = t * v(x) (7.7) 

Those classifiers that successfully triggered an action of the effectors receive a reward 
R(x) from the environment which is added to their strength. Together with the payment 
method, all rules that are involved in a successful action receive some of the reward which 
is handed down stepwise - similar to how water is transported by a bucket brigade. For all 
classifiers that do not produce output to the effectors and also do not receive payment from 
other classifier they have triggered, this reward is null. 

In total, the new strength v t +i(x) of a classifier x is composed of its old strength, its 
payment P(x), the life-tax T(x), and the reward R(x). 

v t+1 (x) = v t (x) - P(x) - T{x) + R(x) (7.8) 

Instead of the Bucket Brigade Algorithm, it is also possible to use Q-Learning in Learning 
Classifier Systems, as shown by Wilson [2235]. Dorigo and Bersini [580] have shown that 
both concepts are roughly equivalent [916]. 

7.3.9 Applying the Genetic Algorithm 

With the credit assignment alone, no new rules can be discovered - only the initial, randomly 
create rule set P is rated. At some certain points in time, a genetic algorithm (see Chapter 3 
on page 141) replaces old rules by new ones. In Learning Classifier Systems we apply steady- 
state genetic algorithms which are discussed in Section 2.1.6 on page 102. They will retain 
most of the classifier population and only replace the weakest rules. Therefore, the strength 
v(x) of a rule x is directly used as its fitness and is subject to maximization. 

For mutation and crossover, the well known reproduction operations for fixed-length 
string chromosomes discussed in Section 3.4 on page 147 are employed. 

7.4 Families of Learning Classifier Systems 

The exact definition of Learning Classifier Systems [1180, 950, 1909, 1251] still seems con- 
tentious and there exist many different implementations. There are, for example, versions 
without message list where the action part of the rules does not encode messages but direct 
output signals. The importance of the role of genetic algorithms in conjunction with the re- 
inforcement learning component is also not quite clear. There are scientists who emphasize 
more the role of the learning components [2239] and others who tend to grant the genetic 
algorithms a higher weight [466, 948]. The families of Learning Classifier Systems have been 
listed and discussed by Brownlee [296] elaborately. Here we will just summarize their differ- 
ences in short. De Jong [514, 513] and Grefenstette [852] divide Learning Classifier Systems 
into two main types, depending on how the genetic algorithm acts: The Pitt approach orig- 
inated at the University of Pittsburgh with the LS-1 system developed by Smith [1912]. 
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It was then developed further and applied by Spears and De Jong [1926], De Jong and 
Spears [516], De Jong et al. [517], Bacardit i Peharroya [92, 93], and Bacardit i Peharroya 
and Krasnogor [94]. Pittsburgh-style Learning Classifier Systems work on a population of 
separate classifier systems, which are combined and reproduced by the genetic algorithm. 

The original idea of Holland and Reitman [948] were Michigan-style LCSs, where the 
whole population itself is considered as classifier system. They focus on selecting the best 
rules in this rule set [820, 507, 1297]. 

Wilson [2235, 2236] developed two subtypes of Michigan-style LCS: 

1. In ZCS systems, there is no message list use fitness sharing [2235, 418, 302, 300] for a 
Q-learning-likc reinforcement learning approach called QBB. 

2. ZCS have later been somewhat superseded by XCS systems in which the Bucket Brigade 
Algorithm has fully been replaced by Q-learning. Furthermore, the credit assignment is 
based on the accuracy (usefulness) of the classifiers. The genetic algorithm is applied 
to sub-populations containing only classifiers which apply to the same situations. [2236, 
1179, 2237, 2238, 1256] 
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Ant Colony Optimization 



8.1 Introduction 

Inspired by the research done by Deneubourg et al. [554], [553, 839] on real ants and probably 
by the simulation experiments by Stickland et al. [1964], Dorigo ct al. [584] developed the 
Ant Colony Optimization 1 (ACO) Algorithm for problems that can be reduced to finding 
optimal paths in graphs in 1996. [581, 585, 1352, 1355, 593] Ant Colony Optimization is based 
on the metaphor of ants seeking food. In order to do so, an ant will leave the anthill and 
begin to wander into a random direction. While the little insect paces around, it lays a trail 
of pheromone. Thus, after the ant has found some food, it can track its way back. By doing 
so, it distributes another layer of pheromone on the path. An ant that senses the pheromone 
will follow its trail with a certain probability. Each ant that finds the food will excrete some 
pheromone on the path. By time, the pheromone density of the path will increase and more 
and more ants will follow it to the food and back. The higher the pheromone density, the 
more likely will an ant stay on a trail. However, the pheromones vaporize after some time. 
If all the food is collected, they will no longer be renewed and the path will disappear after 
a while. Now, the ants will head to new, random locations. 

This process of distributing and tracking pheromones is one form of stigmergy 2 and was 
first described by Grasse [849]. Today, we subsume many different ways of communication 
by modifying the environment under this term, which can be divided into two groups: 
sematectonic and sign-based [1833]. According to Wilson [2231], we call modifications in the 
environment due to a task-related action which leads other entities involved in this task to 
change their behavior sematectonic stigmergy. If an ant drops a ball of mud somewhere, this 
may cause other ants to place mud balls at the same location. Step by step, these effects can 
cumulatively lead to the growth of complex structures. Sematectonic stigmergy has been 
simulated on computer systems by, for instance, Theraulaz and Bonabeau [2032] and with 
robotic systems by Werfel and Nagpal [2192, 1837, 2193]. 

The second form, sign-based stigmergy, is not directly task-related. It has been attained 
evolutionary by social insects which use a wide range of pheromones and hormones for 
communication. Computer simulations for sign-based stigmergy were first performed by 
Stickland et al. [1964] in 1992. 

The sign-based stigmergy is copied by Ant Colony Optimization [584], where optimiza- 
tion problems are visualized as (directed) graphs. First, a set of ants performs random- 
ized walks through the graphs. Proportional to the goodness of the solutions denoted by 
the paths, pheromones are laid out, i. e., the probability to walk into the direction of the 
paths is shifted. The ants run again through the graph, following the previously distributed 
pheromone. However, they will not exactly follow these paths. Instead, they may deviate 

1 http://en.wikipedia.org/wiki/Ant_colony_optimization [acceded 2007-07-03] 

2 http://en.wikipedia.org/wiki/Stigmergy [accessed 2007-07-03] 
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from these routes by taking other turns at junctions, since their walk is still randomized. 
The pheromones modify the probability distributions. 

It is interesting to note that even real vector optimizations can be mapped to a graph 
problem, as introduced by Korosec and Silc [1176]. Thanks to such ideas, the applicability 
of Ant Colony Optimization is greatly increased. 



8.2 General Information 




8.2.1 Areas Of Application 




Some example areas of application of Ant Colony Optimization are: 


Application 


References 


Combinatorial Optimization 
Scheduling 

Networking and Communication 
Combinatorial Optimization 


[763, 582, 869, 765, 764, 304, 

305, 577] 

[1392] 

[1833, 1832, 2033, 1880, 725, 

1281, 559, 561, 245] 

see Section 23.2 on page 401 

[1509] 



8.2.2 Conferences, Workshops, etc. 

Some conferences, workshops and such and such on Ant Colony Optimization arc: 



ANTS: International Conference on Ant Colony Optimization and Swarm Intelligence 
http : / / iridia . ulb . ac . be/~ants/ [accessed 2008-08-20] 
History: 2008: Brussels, Belgium, see [1947] 

2006: Brussels, Belgium, see [594] 

2004: Brussels, Belgium, see [592] 

2002: Brussels, Belgium, see [590] 

2000: Brussels, Belgium, see [589] 

1998: Brussels, Belgium, see [588] 
BIOMA: International Conference on Bioinspired Optimization Methods and their Appli- 
cations 

see Section 2.2.2 on page 105 
CEC: Congress on Evolutionary Computation 

see Section 2.2.2 on page 105 
GECCO: Genetic and Evolutionary Computation Conference 

see Section 2.2.2 on page 107 
ICNC: International Conference on Advances in Natural Computation 

see Section 1.6.2 on page 89 



8.2.3 Journals 

Some journals that deal (at least partially) with Ant Colony Optimization are: 



8.3 River Formation Dynamics 247 



Adaptive Behavior, ISSN: Online: 1741-2633, Print: 1059-7123, appears quaterly, editor(s): 
Peter M. Todd, publisher: Sage Publications, http://www.isab.org/journal/ [accessed 2007- 
09-16], http : / /adb . sagepub . com/ [accessed 2007-09-16] 

Artificial Life, ISSN: 1064-5462, appears quaterly, editor(s): Mark A. Bedau, publisher: MIT 

Press, http : //www . mitpress j ournals . org/loi/artl [accessed 2007-09-16] 

IEEE Transactions on Evolutionary Computation (see Section 2.2.3 on page 108) 

The Journal of the Operational Research Society (see Section 1.6.3 on page 91) 



8.2.4 Online Resources 

Some general, online available ressources on Ant Colony Optimization are: 



http : / / iridia . ulb . ac . be/~mdorigo/ACO/ [accessed 2007-0913] 
Last update: up-to-date 

Description: Repository of books, publications, people, jobs, and software about ACO. 
http : //uk . geocities . com/markcsinclair/aco . html [accessed 2007-09-13] 
Last update: 2006-11-17 

Small intro to ACO, some references, and a nice applet demonstrating its 
Description: app ii ca tion to the travelling salesman problem [1263, 78]. 



8.2.5 Books 

Some books about (or including significant information about) Ant Colony Optimization 
are: 

Chan and Tiwari [372] : Swarm Intelligence - Focus on Ant and Particle Swarm Optimization 
Dorigo and Stiitzle [583]: Ant Colony Optimization 
Engelbrecht [633]: Fundamentals of Computational Swarm Intelligence 
Nedjah and de Macedo Mourelle [1509]: Systems Engineering using Particle Swarm Optimi- 
sation 

Bonabeau, Dorigo, and Theraulaz [245]: Swarm Intelligence: From Natural to Artificial Sys- 
tems 



8.3 River Formation Dynamics 



River Formation Dynamics (RFD) is a heuristic optimization method recently developed 
by Rabanal Basalo et al. [1689, 1690]. It is inspired by the way water forms rivers by 
eroding the ground and depositing sediments. In its structure, it is very close to Ant Colony 
Optimization. In Ant Colony Optimization, paths through a graph are searched by attaching 
attributes (the pheromones) to its edges. The pheromones are laid out by ants (+) and 
vaporize as time goes by (-). In River Formation Dynamics, the heights above sea level are 
the attributes of the vertexes of the graph. On this landscape, rain begins to fall. Forced 
by gravity, the drops flow downhill and try to reach the sea. The altitudes of the points 
in the graph are decreased by erosion (-) when water flows over them and increased by 
sedimentation (+) if drops end up in a dead end, vaporize, and leave the material which 
they have eroded somewhere else behind. Sedimentation punishes inefficient paths: If drops 
reaching a node surrounded only by nodes of higher altitudes will increase height more 
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and more until it reaches the level of its neighbors and is not a dead end anymore. While 
flowing over the map, the probability that a drop takes a certain edge depends on gradient 
of the down slope. This gradient, in turn, depends on the difference in altitude of the nodes 
it connects and their distance (i. e., the cost function). Initially, all nodes have the same 
altitude except for the destination node which is a hole. New drops are inserted in the origin 
node and flow over the landscape, reinforce promising paths, and either reach the destination 
or vaporize in dead ends. 

Different from ACO, cycles cannot occur in RFD because the water always flows downhill. 
Of course, rivers in nature may fork and reunite, too. But, unlike ACO, River Formation 
Dynamics implicitly creates direction information in its resulting graphs. If this information 
is considered to be part of the solution, then cycles arc impossible. If it is stripped away, 
cycles may occur. 
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Particle Swarm Optimization 



9.1 Introduction 

Particle Swarm Optimization 1 (PSO), developed by Eberhart and Kennedy [615, 1124] in 
1995, is a form of swarm intelligence in which the behavior of a biological social system like 
a flock of birds or a school of fish [1616] is simulated. When a swarm looks for food, its 
individuals will spread in the environment and move around independently. Each individual 
has a degree of freedom or randomness in its movements which enables it to find food 
accumulations. So, sooner or later, one of them will find something digestible and, being 
social, announces this to its neighbors. These can then approach the source of food, too. 
Particle Swarm Optimization has been discussed, improved, and refined by many researchers 
such as Venter and Sobieszczanski-Sobieski [2113], Cai et al. [324], Gao and Duan [771], and 
Gao and Ren [772] . Comparisons with other evolutionary approaches have been provided by 
Eberhart and Shi [616] and Angeline [64]. 

With Particle Swarm Optimization, a swarm of particles (individuals) in a n-dimensional 
search space G is simulated, where each particle p has a position p.g eGCl" and a velocity 
p.w G MP. The position p.g corresponds to the genotypes, and, in most cases, also to the 
solution candidates, i. e., p.x — p.g, since most often the problem space X is also the M™ 
and X = G. However, this is not necessarily the case and generally, we can introduce any 
form of genotype-phcnotype mapping in Particle Swarm Optimization. The velocity vector 
p.v of an individual p determines in which direction the search will continue and if it has an 
explorative (high velocity) or an exploitive (low velocity) character. 

In the initialization phase of Particle Swarm Optimization, the positions and velocities 
of all individuals arc randomly initialized. In each step, first the velocity of a particle is 
updated and then its position. Therefore, each particle p has a memory holding its best 
position best(p) G G. In order to realize the social component, the particle furthermore 
knows a set of topological neighbors N(p). This set could be defined to contain adjacent 
particles within a specific perimeter, i. e., all individuals which are no further away from p.g 
than a given distance 5 according to a certain distance measure 2 dist. Using the Euclidian 
distance measure dist euc ; specified in Definition 29.8 on page 538 we get: 

V p,q G Pop: q G N(p) dist eucl (p. g,q.g) < 5 (9.1) 

Each particle can communicate with its neighbors, so the best position found so far by any 
element in N(p) is known to all of them as best(N(p)). The best position ever visited by 
any individual in the population (which the optimization algorithm always keeps track of) 
is best (Pop). 

The PSO algorithm may make use of either best(N(p)) or best(Pop) for adjusting the 
velocity of the particle p. If it relies on the global best position, the algorithm will converge 

1 http://en.wikipedia.org/wiki/Particle_swarm_optimization [accessed 2007-07-03] 

2 See Section 29.1 on page 537 for more information on distance measures. 
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fast but may find the global optimum less probably. If, on the other hand, neighborhood 
communication is used, the convergence speed drops but the global optimum is found more 
likely. 

Definition 9.1 (psoUpdatc). The search operation q — psoUpdate(p, Pop) applied in Par- 
ticle Swarm Optimization creates a new particles q to replace an existing one (p) by incor- 
porating its genotype p.g, its velocity p.v. We distinguish local updating (Equation 9.3) and 
global updating (Equation 9.2), which additionally uses the data from the whole population 
Pop. psoUpdate thus fulfills one of these two equations and Equation 9.4, showing how the 
i th components of the corresponding vectors are computed. 

q.Vi = p.Vi + (random M (0, c;) * (best(p) - p.g { )) + (9.2) 
(random„(0, dj) * (bcst(Pop) .gi - p.gi)) 

q.Vi = p.Vi + (random M (0, C;) * (best(p) .gi - p.gi)) + (9.3) 
(random„(0,d 4 ) * (best(N(p)) .g t -p.gi)) 

1-9i = P-9i +P-Vi (9.4) 

The learning rate vectors c and d have strong influence of the convergence speed of Par- 
ticle Swarm Optimization. The search space G (and thus, also the values of p.g) is normally 
confined by minimum and maximum boundaries. For the absolute values of the velocity, 
normally maximum thresholds also exist. Thus, real implementations of "psoUpdate" have 
to check and refine their results before the utility of the solution candidates is evaluated. 

Algorithm 9.1 illustrates the native form of the Particle Swarm Optimization using the 
update procedure from Definition 9.1. Like hill climbing, this algorithm can easily be gener- 
alized for multi-objective optimization and for returning sets of optimal solutions (compare 
with Section 10.3 on page 254). 



Algorithm 9.1: x* < — psoOptimizcr/ps 



Input: /: the function to optimize 
Input: ps: the population size 
Data: Pop: the particle population 
Data: i: a counter variable 
Output: x*: the best value found 

l begin 

Pop < — createPop(ps) 
while terminationCriterionQ do 

for i < — up to len(Pop) — 1 do 
j_ Pop[i] < — psoUpdate(Pop[i], Pop) 



6 



7 end 



return best(Pop) .x 



9.2 General Information 
9.2.1 Areas Of Application 

Some example areas of application of particle swarm optimization are: 
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Application References 



Machine Learning [1124, 1386, 1708] 

Function Optimization [1124, 1617] 

Geometry and Physics [2263] 

Operations Research [125] 

Chemistry, Chemical Engineering [356, 1864] 

Electrical Engineering and Circuit Design [1509] 



9.2.2 Online Resources 

Some general, online available ressources on particle swarm optimization are: 



http : //www . swarmintelligence . org/ [accessed 2007-08-26] 

Last update: up-to-date 

Description: Particle Swarm Optimization Website by Xiaohui Hu 
http://www.red3d.com/cwr/boids/ [accessed 2007-08-20] 
Last update: up-to-date 

Description: Boids - Background and Update by Craig Reynolds 
http : //www.projectcomputing. com/resources/psovis/ [accessed 2007-08-20] 
Last update: 2004 

Description: Particle Swarm Optimization (PSO) Visualisation (or "PSO Visualization") 
http : // www . engr . iupui . edu/ ~ eberhart / [accessed 2007-08-20] 
Last update: 2003 

Description: Russ Eberhart's Home Page 
http : //www . cis . syr . edu/~mohan/pso/ [accessed 2007-08-20] 
Last update: 1999 

Description: Particle Swarm Optimization Homepage 
http : //tracer . uc3m . es/tws/pso/ [accessed 2007-11-00] 
Last update: up-to-date 

Description: Website on Particle Swarm Optimization 



9.2.3 Conferences, Workshops, etc. 

Some conferences, workshops and such and such on particle swarm optimization are: 



GECCO: Genetic and Evolutionary Computation Conference 

see Section 2.2.2 on page 107 
ICNC: International Conference on Advances in Natural Computation 

see Section 1.6.2 on page 89 
SIS: IEEE Swarm Intelligence Symposium 

http : //www . computelligence . org/sis/ [accessed 2007-08-26] 

History: 2007: Honolulu, Hawaii, USA, see [1867] 
2006: Indianapolis, IN, USA, see [1022] 
2005: Pasadena, CA, USA, see [1021] 
2003: Indianapolis, IN, USA, see [1020] 



9.2.4 Books 

Some books about (or including significant information about) particle swarm optimization 
are: 
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Nedjah and de Macedo Mourelle [1508]: Swarm Intelligent Systems 

Chan and Tiwari [372] : Swarm Intelligence - Focus on Ant and Particle Swarm Optimization 
Clerc [415]: Particle Swarm Optimization 

Bui and Alam [299]: Multi- Objective Optimization in Computational Intelligence: Theory 
and Practice 

Engclbrecht [633]: Fundamentals of Computational Swarm Intelligence 
Kennedy, Eberhart, and Shi [1125]: Swarm Intelligence: Collective, Adaptive 
Nedjah and de Macedo Mourelle [1509]: Systems Engineering using Particle Swarm Optimi- 
sation 
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Hill Climbing 



10.1 Introduction 

Hill climbing 1 (HC) [1780] is a very old and simple search and optimization algorithm for 
single objective functions /. In principle, hill climbing algorithms perform a loop in which 
the currently known best solution individual p* is used to produce one offspring p new . If this 
new individual is better than its parent, it replaces it. Then, the cycle starts all over again. In 
this sense, it is similar to an evolutionary algorithm with a population size psoi 1. Although 
the search space G and the problem space X are most often the same in hill climbing, we 
distinguish them in Algorithm 10.1 for the sake of generality. Hill climbing furthermore 
normally uses a parameterless search operation to create the first solution candidate and, 
from there on, unary operations to produce the offspring. Without loss of generality, we 
will thus make use of the reproduction operations from evolutionary algorithms defined 
in Section 2.5 on page 137, i. e., set Op = {create, mutate}. 

The major problem of hill climbing is premature convergence, i.e., it gets easily stuck on a 
local optimum. It always uses the best known solution candidate x* to find new points in the 
problem space X. Hill climbing utilizes a unary reproduction operation similar to mutation 
in evolutionary algorithms. It should be noted that hill climbing can be implemented in a 
deterministic manner if the neighbor sets in search space G, which here most often equals 
the problem space X, are always finite and can be iterated over. 



http : //en. wikipedia. org/wiki/Hill_climbing [accessed 2007-07-03] 
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Algorithm 10.1: x* < — hillClimber(/) 



Input: /: the objective function subject to minization 
Data: p new : the new element created 
Data: p*: the (currently) best individual 
Output: x*: the best element found 

l begin 

p* .g < — create() 

// Implicitly: p* .x < gpm(p*.g) 

3 while terminationCriterionQ do 

4 p„ew -g < — mutate(p* .g) 

// Implicitly: p new .x < gpm(p ne „ .g) 

if f(p„ ew .x) < f(p* .x) then p* < — p„ 

6 return p* .x 

7 end 



'new 



10.2 General Information 
10.2.1 Areas Of Application 

Some example areas of application of hill climbing are: 



Application References 

Networking and Communication [2268] 

see Section 23.2 on page 401 

Data Mining and Data Analysis [646] 
Evolving Behaviors, e.g., for Agents or Game Players [2017] 
Combinatorial Optimization [953, 347] 



10.3 Multi-Objective Hill Climbing 

As illustrated in Algorithm 10.2 on the next page, we can easily extend hill climbing al- 
gorithms with a support for multi-objective optimization by using some of the methods 
of evolutionary algorithms. This extended approach will then return a set X* of the best 
solutions found instead of a single individual x* as done in Algorithm 10.1. The set of 
currently known best individuals Arc may contain more than one element. Therefore, we 
employ a selection scheme in order to determine which of these individuals should be used 
as parent for the next offspring in the multi-objective hill climbing algorithm. The selection 
algorithm applied must not solely rely on the prevalence comparison, since no element in 
Arc prevails any other. Thus, we also copy the idea of fitness assignment from evolutionary 
algorithms. For maintaining the optimal set, we apply the updating and pruning methods 
defined in Chapter 19 on page 307. 



10.4 Problems in Hill Climbing 
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Algorithm 10.2: 



hillClimberMO(cmp F , a) 



Input: cmp F : the prevalence comparator 
Input: as: the maximum archive size 
Data: p n ew'- the new individual generated 
Data: Arc: the set of best individuals known 
Output: X*: the set of the best elements found 

begin 

Arc ■ 



l 

2 

3 

4 
5 
6 
7 
8 
9 

10 

ll end 





Pne W -9 < — create() 
// Implicitly: p new .x ■ 



gpm(p netu .g) 



while terminationCriterionQ do 

Arc < — updateOptimalSet(Arc, p n ew) 
Arc < — pruncOptimalSet(ylrc, as) 
v < — assignFitness(Arc, cmp F ) 
Pnew < — select( J 4rc, v, 1) [o] 
< — mutate(p ne „.g) 
// Implicitly: p new .x < gpm(p„ ew .g) 

return extractPhenotypes(Arc) 



10.4 Problems in Hill Climbing 

Both versions of the algorithm are still very likely to get stuck on local optima. They will 
only follow a path of solution candidates if it is monotonously 2 improving the objective 
function(s). Hill climbing in this form is a local search rather than global optimization 
algorithm. By making a few slight modifications to the algorithm however, it can become a 
valuable global optimization technique: 

1. A tabu-list which stores the elements recently evaluated can be added. By preventing 
the algorithm from visiting them again, a better exploration of the problem space X can 
be enforced. This technique is used in Tabu Search which is discussed in Chapter 14 on 
page 273. 

2. Another way of preventing premature convergence is to not always transcend to the 
better solution candidate in each step. Simulated Annealing introduces a heuristic based 
on the physical model the cooling down molten metal to decide whether a superior 
offspring should replace its parent or not. This approach is described in Chapter 12 on 
page 263. 

3. The Dynamic Hill Climbing approach by Yuret and de la Maza [2303] uses the last two 
visited points to compute unit vectors. With this technique, the directions are adjusted 
according to the structure of the problem space and a new coordinate frame is created 
which points more likely into the right direction. 

4. Randomly restarting the search after so-and-so many steps is a crude but efficient method 
to explore wide ranges of the problem space with hill climbing. You can find it outlined 
in Section 10.5. 

5. Using a reproduction scheme that not necessarily generates solution candidates directly 
neighboring x*, as done in Random Optimization, an optimization approach defined 
in Chapter 11 on page 259, may prove even more efficient. 



2 http://en.wikipedia.org/wiki/Monotonicity [accessed 2007-07-03] 
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10.5 Hill Climbing with Random Restarts 

Hill climbing with random restarts is also called Stochastic Hill Climbing (SH) or Stochastic 
gradient descent' 1 ' [1923, 605]. We have mentioned it as a measure for preventing premature 
convergence; here we want to take a deeper look on this approach. 

Let us further combine it directly with the multi-objective hill climbing approach defined 
in Algorithm 10.2. The new algorithm incorporates two archives for optimal solutions: Arci, 
the overall optimal set, and Arc2 the set of the best individuals in the current run. We 
additionally define the criterion shouldRcstartQ which is evaluated in every iteration and 
determines whether or not the algorithm should be restarted. shouldRcstartQ therefore 
could, for example, count the iterations performed or check if any improvement was produced 
in the last ten iterations. After each single run, Arc2 is incorporated into Arc\, from which 
we extract and return the problem space elements at the end of the hill climbing process, 
as defined in Algorithm 10.3. 



Algorithm 10.3: X* < — hillClimberMO_RR(cmp F , a) 



7 
8 
9 
10 
11 

12 
13 



Input: cmp F : the prevalence comparator 

Input: os: the maximum archive size 

Data: Pnew : the new individual generated 

Data: Arci, Arc-z\ the sets of best individuals known 

Output: X*: the set of the best elements found 

begin 

Arci 







while terminationCriterionQ do 

Arc 2 < () 

< — createQ 
// Implicitly: p new .x < gpm(p 

11 



• ■g) 



while terminationCriterionQ V shouldRestartQ do 
Arc2 < — updatcOptimalSet(Arc2,Pneu>) 
Arc2 < — pruneOptimalSet(Arc2, as) 
v < — assignFitness(j4rc2, cmp F ) 
Pnew < — select ( A rc2, v, 1) [o] 

Pnew-g < nmtate(p new .g) 

// Implicitly: p new .x < gpm(p„ ew .g) 

Arci < — updateOptimalSetN( J 4rci, j4rc2) 
Arci < — pruneOptimalSet(drci, as) 

return extractPhenotypes(drcQ 



15 end 



10.6 GRASP 

Greedy Randomized Adaptive Search Procedures (GRASPs) [663, 652, 1648, 1722] are meta- 
heuristics which repeatedly create new starting points and refine these with a local search 
algorithm until a termination criterion is met. In this, they are similar to hill climbing with 
random restarts. 

The initial construction phase of each iteration, however, may be much more complicated 
than just randomly picking a new point in the search space. Feo and Resende [652] describe 
it as an iterative construction process where one element [gene] is "added" at a time where 

3 http://en.wikipedia.org/wiki/Stochastic_gradient_descent [accessed 2007-07-03] 
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the element to be added is chosen with respect to a greedy function. Here, not necessarily 
the best possible allele is set, but one of the top candidates is picked randomly. 

After the initial solution is generated this way, a local search is applied to refine it. 
Therefore, hill climbing, for instance, could be used as well as a deterministic search such as 
a IDDFS (see Section 17.3.4) or a greedy approach (see Section 17.4.1). Feo and Resende [652] 
argue that the efficiency and quality of the solutions produced by such GRASP processes 
are often much better than those of local searches started at random points. 



10.6.1 General Information 




Areas Of Application 




Some example areas of application of GRASP are: 




Application 


References 


Combinatorial Optimization 


[651, 652, 1288] 


Scheduling 


[650] 



Online Resources 

Some general, online available ressources on GRASP are: 



http : / / www . graspheur ist i c . org/ [accused 2008-10-20] 
Last update: 2004-02-29 

Description: A website leading to a lage annontated bibliography on GRASP. 



10.7 Raindrop Method 

Only three years ago, Bettingcr and Zhu [191, 2322] contributed a new search heuristic 
for constrained optimization, the Raindrop Method, which they used for forest planning 
problems. In the original description of the algorithm, the search and problem space are 
identical (G = X). The algorithm is based on precise knowledge of the components of the 
solution candidates and on how their interaction influences the validity of the constraints. 
It works as follows: 

1. The Raindrop Method starts out with a single, valid solution candidate x* (i.e., one that 
violates none of the constraints). This candidate may be found with a random search 
process or may be provided created by a human operator. 

2. Create a copy x of x*. Set the iteration counter t to a user-defined maximum value T of 
iterations of modifying and correcting x that are allowed without improvements before 
reverting to x* . 

3. Perturb x by randomly modifying one of its components. Let us refer to this randomly 
selected component as s. This modification may lead to constraint violations. 

4. If no constraint was violated, continue at step 11, otherwise proceed as follows. 

5. Set a distance value d to 0. 

6. Create a list L of the components of x that lead to constraint violations. Here we make 
use the knowledge of the interaction of components and constraints. 
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7. From L, we pick the component c physically closest to s, that is, the component with 
the minimum distance dist(c, s). In the original application of the Raindrop Method, 
physically close was properly defined due to the fact that the solution candidates were 
basically two-dimensional maps. For applications different from forest planning, appro- 
priate definitions for the distance measure have to be supplied. 

8. Set d — dist(c, s). 

9. Find the next best value for c which does not induce any new constraint violations in 
components / which arc at least as close to s, i. e., with dist(/, s) < dist(c, s). This 
change may, however, cause constraints violations in components farther away from s. 
If no such change is possible, go to point 13. Otherwise, modify the component c in x. 

10. Go back to step 4. 

11. If x is better than x*, that is, x>-x* , set x* — x. Otherwise, decrease the iteration counter 
t. 

12. If the termination criterion has not yet been met, go back to step 3 if t > and to 2 if 
t = 0. 

13. Return x* to the user. 

The iteration counter t here is used to allow the search to explore solutions more dis- 
tance from the current optimum x* . The higher the initial value T specified user, the more 
iterations without improvement are allowed before reverting x io x* . By the way, the name 
Raindrop Method comes from the fact that the constraint violations caused by the pertur- 
bation of the valid solution radiate away from the modified component s like waves on a 
water surface radiate away from the point where a raindrop hits. 
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Random Optimization 



11.1 Introduction 

The Random Optimization 1 method for single-objective, numerical problems, i. e., G = M™ 
and \F\ = 1, was first proposed by Rastrigin [1709] in the early 1960s. It was studied 
thoroughly by Gurin and Rastrigin [870], Schumer [1838] and further improved by Schumer 
and Steiglitz [1839] [2206]. A different Random Optimization approach has been introduced 
by Matyas [1371, 1372] around the same time. Matyas gave theorems about the convergence 
properties of his approach for unimodal optimization. Baba [91] then showed theoretically 
that the global optimum of an optimization problem can even be found if the objective 
function is multimodal. 

There are, however, three important differences between the two approaches: 

1. In traditional hill climbing, the new solution candidates are created from a good individ- 
ual are always very close neighbors of it. In Random Optimization, this is not necessary 
but only probably. 

2. In Random Optimization, the unary search operation explicitly uses random numbers 
whereas the unary search operations of hill climbing may be deterministic or randomized. 

3. In Random Optimization, the search space G is always the R™, the space of n-dimensional 
real vectors. 

4. In Random Optimization, we explicitly distinguish between objective functions / £ F 
and constraints c e C. 

Random Optimization introduces a new search operation "roRcproduce" specialized for 
the numerical search space similar to mutation in Evolution Strategies. This operation is 
constructed in a way that all points in the search space G = M™ can be reached in one step 
when starting out from every other point. In other words, the operator "roReproduce" is 
complete in the sense of Definition 1.27. 



roReproduce(g) = g + r : r 



/ random„(^i,cr?) (~ N(m, of)) \ 
random n (^ 2 ,cr|) (~ N(/i 2 , a - !)) 

\random„(/x„,(7^) (~ N((i n ,ol)) J 



(11.1) 



Equation 11.1 illustrates one approach to realize such a complete search operation. To 
the (genotype) of the best solution candidate discovered so far, we add a vector r G 1™. 
Each component r[«] of this vector is normally distributed around a value /i^. Hence, the 
probability density function underlying the components of this vector is greater than zero 
for all real numbers. The in are the expected values and the Oi the standard deviations of the 



http : //en. wikipedia. org/wiki/Random_optimization [accessed 2007- 



07-03] 
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normal distributions, as introduced in Section 28.5.2 on page 486. The \n define a general 
direction for the search, i. e., if tij > 0, random„(^, of) will likely also be greater zero and 
for < 0, it will probably be smaller than zero, too (given that |<7j| <C \expectedV alueMj\) . 
The <7j can be imagined as the range in which the random numbers are distributed around 
the Hi and denote a step width of the random numbers. If we choose the absolute values of 
both, fii and a,, very small, we can exploit a local optimum whereas larger values lead to 
a rougher exploration of search space. If the /zs are set to 0, the probability distribution of 
random numbers will be "centered" around the genotype it is applied to. Since the normal 
distribution is generally unbounded, it is possible that the random elements in r can become 
very large, even for small <7j w 0. Therefore, local optima can be left again even with bad 
settings of \i and a. 

Equation 11.1 is only one way to realize the completeness of the "roReproduce"- 
operation. Instead of the normal distribution, any other probability distribution with 
fx(y) > G M would do. Good properties can, for instance, be attributed to the bell- 
shaped distribution used by Worakul et al. [2255, 2256] and discussed in Section 28.9.3 on 
page 530 in this book. 

In order to respect the idea of constraint optimization in Random Optimization as in- 
troduced in [91], we define a set of constraint functions C. A constraint c G C is satisfied by 
a solution candidate x G X, if c(x) < holds. 

Algorithm 11.1 illustrates how random optimization works, clearly showing connatural 
traits in comparison with the hill climbing approach Algorithm 10.1 on page 254. 



Algorithm 11.1: x* < — randomOptimizer/ 



Input: /: the objective function subject to minization 
Data: p ne w'- the new element created 
Data: p*: the (currently) best individual 
Output: x*\ the best element found 

1 begin 

2 p* .g < — create() 
// Implicitly: p* .x < gpm(p*.<?) 



7 



while terminationCriterionQ do 
Pnew-g < — roReproduce(p*.<?) 

// Implicitly: p new .x< gpm(p new .g) 

if c(p„ew-x) < Vc G C then 
|_ if f(pnew-x) < f(p*-x) then p* < — p ne 



return p* .x 



8 end 



Setting the values of ji and a adaptively can lead to large improvements in convergence 
speed. The Heuristic Random Optimization (HRO) algorithm introduced by Li and Rhine- 
hart [1277] and its successor method Random Optimization II developed by Chandran and 
Rhinehart [373] for example update them by utilizing gradient information or reinforcement 
learning. 



11.2 General Information 
11.2.1 Areas Of Application 

Some example areas of application of (heuristic) Random Optimization are: 
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Application References 

Medicine [2255, 2256] 

Biology and Medicine [558] 

Machine Learning [1249] 

Function Optimization [1277, 1917] 
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Simulated Annealing 



12.1 Introduction 

In 1953, Metropolis et al. [1396] developed a Monte Carlo method for "calculating the 
properties of any substance which may be considered as composed of interacting individual 
molecules" . With this so-called "Metropolis" procedure stemming from statistical mechan- 
ics, the manner in which metal crystals reconfigure and reach equilibria in the process of 
annealing can be simulated. This inspired Kirkpatrick et al. [1142] to develop the Simulated 
Annealing 1 (SA) algorithm for global optimization in the early 1980s and to apply it to var- 
ious combinatorial optimization problems. Independently, Cerny [363] employed a similar 
approach to the travelling salesman problem [1263, 78]. Simulated Annealing is an optimiza- 
tion method that can be applied to arbitrary search and problem spaces. Like simple hill 
climbing algorithms, Simulated Annealing only needs a single initial individual as starting 
point and a unary search operation. 

In metallurgy and material science, annealing 2 is a heat treatment of material with the 
goal of altering its properties such as hardness. Metal crystals have small defects, dislocations 
of ions which weaken the overall structure. By heating the metal, the energy of the ions 
and, thus, their diffusion rate is increased. Then, the dislocations can be destroyed and the 
structure of the crystal is reformed as the material cools down and approaches its equilibrium 
state. When annealing metal, the initial temperature must not be too low and the cooling 
must be done sufficiently slowly so as to avoid the system getting stuck in a meta-stable, 
non-crystalline, state representing a local minimum of energy. 

In physics, each set of positions of all atoms of a system pos is weighted by its 

Boltzmann probability factor e k B T where E(pos) is the energy of the configuration 
pos, T is the temperature measured in Kelvin, and ks is the Boltzmann's constant ' 
k B = 1.380 650 524- 1CT 23 J/K. 

The Metropolis procedure was an exact copy of this physical process which could be used 
to simulate a collection of atoms in thermodynamic equilibrium at a given temperature. A 
new nearby geometry posi+i was generated as a random displacement from the current 
geometry posi of an atom in each iteration. The energy of the resulting new geometry is 
computed and AE, the energetic difference between the current and the new geometry, 
was determined. The probability that this new geometry is accepted, P(AE) is defined in 
Equation 12.2. 



http : //en. wikipedia. org/wiki/Simulated_annealing [accessed 2007-07-03] 
http : //en. wikipedia. org/wiki/Annealing_ (metallurgy) [accessed 200S-09-19] 
3 http://en.wikipedia.org/wiki/Boltzmann%27s_constant [accessed 2007-07-03] 
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AE = E(pos l+1 ) - E(po Si ) (12.1) 

P(AE) = [ e ~^ if AE>0 (12.2) 
I 1 otherwise 

Thus, if the new nearby geometry has a lower energy level, the transition is accepted. 
Otherwise, a uniformly distributed random number r — random„() G [0, 1) is drawn and the 
step will only be accepted in the simulation if it is less or equal the Boltzmann probability 
factor, i.e., r < P(AE). At high temperatures T, this factor is very close to 1, leading to the 
acceptance of many uphill steps. As the temperature falls, the proportion of steps accepted 
which would increase the energy level decreases. Now the system will not escape local regions 
anymore and (hopefully) comes to a rest in the global minimum at temperature T = OK. 

The abstraction of this method in order to allow arbitrary problem spaces is straightfor- 
ward - the energy computation E(posi) is replaced by an objective function / or even by 
the result v of a fitness assignment process. Algorithm 12.1 illustrates the basic course of 
Simulated Annealing. Without loss of generality, we reuse the definitions from evolutionary 
algorithms for the search operations and set Op = {create, mutate}. 



Algorithm 12.1: 



simulated Annealing ( / ) 



Input: /: the objective function to be minimized 

Data: p„ euJ : the newly generated individual 

Data: p cur - the point currently investigated in problem space 

Data: p*: the best individual found so far 

Data: T: the temperature of the system which is decreased over time 

Data: t: the current time index 

Data: AE: the enery difference of the x nem and 

Xcur 

Output: x*: the best element found 

begin 

Pnew-g < — create() 

// Implicitly: p new .x < gpm(p netu .p) 

Pcur 4 Pnew 



5 
6 
7 
8 
9 
10 
11 
12 

13 
14 

15 
16 



Pnew 

r ■ x ) < f{P* - x ) then p* 



t < — 

while terminationCriterionQ do 

AE < f(p„ew-x) - f{pcur-x) 

if AE < then 

Pcur *~ 

if /(P. 
else 

T < — getTemperature(i) 

_ AE 

if random^) < e k B T then p 

Pnew -g < — mutate(fw.fl>) 

// Implicitly: p„ e w -x < gpm(p ne „ .g) 

t< — t + 1 

return p* .x 



Pcur 



pn 



17 end 



It has been shown that Simulated Annealing algorithms with appropriate cooling strate- 
gies will asymptotically converge to the global optimum. Nolte and Schrader [1540] and van 
Laarhoven and Aarts [2095] provide lists of the most important works showing that Sim- 
ulated Annealing will converge to the global optimum if t — > oo iterations are performed, 
including the studies of Hajek [879]. Nolte and Schrader [1540] further list research pro- 
viding deterministic, non-infinite boundaries for the asymptotic convergence by Anily and 
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Federgruen [70], Gidas [804], Nolte and Schrader [1539], and Mitra et al. [1437]. In the 
same paper, they introduce a significantly lower bound, which, however, still states that 
Simulated Annealing is probably in an optimal configuration after the number of iterations 
exceeds the cardinality of the problem space - which is, well, slow [1017]. In other words, 
it would be faster to enumerate all possible solution candidates in order to find the global 
optimum with absolute certainty than applying Simulated Annealing. This does not mean 
that Simulated Annealing is always slow. It only needs that much time if we persist on 
the optimality. Speeding up the cooling process will result in a faster search, but voids the 
guaranteed convergence on the other hand. Such speeded-up algorithms are called Simulated 
Quenching (SQ) [1014, 1813, 



12.2 General Information 
12.2.1 Areas Of Application 

Some example areas of application of Simulated Annealing are: 



Application 



References 



Combinatorial Optimization 

Function Optimization 

Chemistry, Chemical Engineering 

Image Processing 

Economics and Finance 

Electrical Engineering and Circuit Design 

Machine Learning 

Geometry and Physics 

Networking and Communication 



[363, 1142, 298, 286, 473] 
[818] 

[1292, 297, 1075, 1401] 
[1982, 2246, 2287, 298] 
[1015, 1016] 

[1366, 1349, 298, 2070] 
[1367, 298, 1368, 1853] 
[1683] 

see Section 23.2 on page 401 



For more information see also [2095]. 



12.2.2 Books 



Some books about (or including significant information about) Simulated Annealing are: 

van Laarhoven and Aarts [2095]: Simulated Annealing: Theory and Applications 
Tan [2001]: Simulated Annealing 

Badiru [113]: Handbook of Industrial and Systems Engineering 
Davis [494]: Genetic Algorithms and Simulated Annealing 



12.3 Temperature Scheduling 

The temperature schedule defines how the temperature in Simulated Annealing is decreased. 
As already mentioned, this has major influence on whether the Simulated Annealing algo- 
rithm will succeeded, on whether how long it will take to find the global optimum, and on 
whether or not it will degenerate to simulated quenching. For the later use in the Simulated 
Annealing algorithm, let us define the new operator getTemperature(i) which computes the 
temperature to be used at iteration t in the optimization process. For "get Temperature" , a 
few general rules hold. All schedules start with a temperature T start which is greater than 
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zero. If the number of iterations t approaches infinity, the temperature must become OK. 
This is a very weak statement, since we have shown that there exist finite boundaries after 
which Simulated Annealing is most likely to have converged. So there will be a finite t en d in 
all practical realizations after which the temperature drops to OK, as shown in Equation 12.6. 



Te R + ,t e N V T = getTemperature(t) (12.3) 
T st art = get Temperature (0) > (12.4) 
lim getTemperature(i) = OK (12.5) 

3t end e N : get Temperature (t) = OK \/t > t end (12.6) 

There exists a wide range of methods to determine this temperature schedule. Miki et al. 
[1414], for example, used genetic algorithms for this purpose. We will introduce only the 
three simple variants here given by Press et al. [1675]. 

1. Reduce T to (1 — e) T after every m iterations, where the exact values of < e < 1 and 
m > are determined by experiment. 

2. Grant a total of K iterations, and reduce T after every m steps to a value T = 
T start (l — j() where t is the inted of the current iteration and a is a constant, maybe 
1, 2, or 4. a depends on the positions of the relative minima. Large values of a will 
spend more iterations at lower temperature. 

3. After every m moves, set T to times AE C = f(x cur ) — f(x*), where is an experi- 
mentally determined constant, f(x cur ) is the objective value of the currently examined 
solution candidate x cur , and f(x*) is the objective value of the best phenotype x* found 
so far. Since AE C may be 0, we limit the temperature change to a maximum of T* 7 
with < 7 < 1. 

If we let the temperature sink fast, we will lose the property of guaranteed convergence. 
In order to avoid getting stuck at local optima, we can then apply random restarting, which 
already has been discussed in the context of hill climbing in Section 10.5 on page 256. 



12.4 Multi-Objective Simulated Annealing 

Again, we want to combine this algorithm with multi-objective optimization and also enable 
it to return a set of optimal solutions. This can be done even simpler than in multi-objective 
hill climbing. Basically, we just need to replace the single objective function / with the fitness 
values v computed by a fitness assignment process on basis of the set of currently known best 
solutions (Arc), the currently investigated individual (p CU r), an d the newly created points 
in the search space (p new )- 
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Algorithm 12.2: 



simulatcdAnncalingMO(cmp F , a) 



Input: /: the objective function to be minimized 

Input: as: the maximum number of individuals allowed to be stored in the archive 

Data: p n e W : the newly generated individual 

Data: p cur : the point currently investigated in problem space 

Data: Arc: the set of best individuals found so far 

Data: T: the temperature of the system which is decreased over time 

Data: t: the current time index 

Data: AE: the enery difference of the x nem and 

Xcur 

Output: x*: the best element found 

begin 

Pnew-g * — create() 

// Implicitly: p new .x < gpm(p ne ™ • g) 

Pcur ^ Pnew 



4 
5 
6 
7 
8 
9 
10 
11 
12 

13 

14 
15 
16 



18 



createList(l,p c 







Arc ■ 
t < — 

while terminationCriterionQ do 
v < — assignFitness(Arc U {p 
AE < — v(p new .x) - v(p cur .x) 
if AE < then 



p cur } , Cmp F ) 



new 7 t>cur 



Pne 



else 



T< — getTemperature(t) 

if random^) < e k B T then p 

C 



Pr, 



Arc < — updateOptimalSet(Arc, p„ e w) 
Arc < — pruneOptimalSet(ylrc, as) 
Pnew-g < — mutate(p CU7 ..p) 

// Implicitly: p„ ew -X < gpm(p ne „ .g) 

t< — t + 1 

return extractPhenotypes(Arc) 



19 end 
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Extremal Optimization 



13.1 Introduction 

13.1.1 Self- Organized Criticality 

Different from Simulated Annealing, is a optimization method based on the metaphor of 
thermal equilibria from physics, the Extremal Optimization 1 (EO) algorithm of Boettcher 
and Percus [237, 238, 239, 240, 236] is inspired by ideas of non-equilibrium physics. Especially 
important in this context is the property of self- organized criticality 2 (SOC) [119, 1049]. The 
theory of SOC states that large interactive systems evolve to a state where a change in one 
single of their elements may lead to avalanches or domino effects that can reach any other 
element in the system. The probability distribution of the number of elements n involved 
in these avalanches is proportional to n~ T (r > 0). Hence, mass changes involving only few 
elements are most likely, but even avalanches involving the whole system are possible with 
a non-zero probability. [526] 

13.1.2 The Bak-Sneppens model of Evolution 

The Bak-Sneppens model of evolution [118] exhibits self-organizing criticality and was the 
inspiration for Extremal Optimization. Rather than focusing on single species, this model 
considers a whole ecosystem and the co-evolution of many different species. 

In the model, each species is represented only by a real fitness value between and 1. In 
each iteration, the species with the lowest fitness is mutated. The model does not include any 
representation for genomes, instead, mutation changes the fitness of the species directly by 
replacing it with a random value uniformly distributed in [0, 1]. In nature, this corresponds 
to the process where one species has developed further or was replaced by another one. 

So far, mutation (i. e., development) would become less likely the more the fitness 
increases. Fitness can also be viewed as a barrier: New characteristics must be at least as 
fit as the current ones to proliferate. In an ecosystem however, no species lives alone but 
depends on others, on its successors and predecessors in the food chain, for instance. Bak 
and Sneppen [118] consider this by arranging the species in a one dimensional line. If one 
species is mutated, the fitness values of its successor and predecessor in that line are also 
set to random values. In nature, the development of one species can foster the development 
of others and this way, even highly fit species may become able to (re-)adapt. 

After a certain amount of iterations, the species in simulations based on this model reach 
a highly-correlated state of self-organized critically where all of them have a fitness above 
a certain threshold. This state is very similar to the idea of punctuated equilibria from 
evolutionary biology and groups of species enter a state of passivity lasting multiple cycles. 

1 http://en.wikipedia.org/wiki/Extremal_optimization [accessed 2008-08-24] 
http : //en. wikipedia. org/wiki/Self -organized_criticality [accessed 2008-08-23] 
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Sooner or later, this state is interrupted because mutations occurring nearby undermine 
their fitness. The resulting fluctuations may propagate like avalanches through the whole 
ecosystem. Thus, such non-equilibrium systems exhibit a state of high adaptability without 
limiting the scale of change towards better states [236] . 

13.2 Extremal Optimization and Generalized Extremal 
Optimization 

Boettcher and Percus [237] want to utilize this phenomenology to obtain near-optimal solu- 
tions for optimization problems. In Extremal Optimization, the search spaces G are always 
spaces of structured tuples g = (g[i], g[2], . . . , g[n]). Extremal Optimization works on a single 
individual and requires some means to determine the contributions of its genes to the overall 
fitness. 

Extremal Optimization was originally applied to a graph bi-partitioning problem [237], 
were the n points of a graph had to be divided into two groups, each of size n ji. The objective 
was to minimize the number of edges connecting the two groups. Search and problem space 
can be considered as identical and a solution candidate x = gpm(g) = g consisted of n 
genes, each of which standing for one point of the graph and denoting the Boolean decision 
to which set it belongs. Analogously to the Bak-Sneppens model, each such gene g[i] had an 
own fitness contribution \(g[i]), the ratio of its outgoing edges connected to nodes from the 
same set in relation to its total edge number. The higher this value, the better, but notice 
that f(x = g) ^ Ym=i Msl*])) since f{g) corresponds to the number of edges crossing the 
cut. In general, the Extremal Optimization algorithm proceeds as follows: 

1. Create an initial individual p with a random genotype p.g and set the currently best 
known solution candidate x* to its phenotype: x* — p.x. 

2. Sort all genes p.g[i] of p.g in a list in ascending order according to their fitness contri- 
bution X(p.g[i]). 

3. Then, the gene p.g[i] with the lowest fitness contribution is selected from this list and 
modified randomly, leading to a new individual p and a new solution candidate x = 
p.x = gpm(p.g). 

4. If p.x is better that x* , i. e., p.x>~x* , set x* = p.x. 

5. If the termination criterion has not yet been met, continue at step 2. 

Instead of always picking the weakest part of g, Boettcher and Percus [238] selected the 
gene(s) to be modified randomly in order to prevent the method from getting stuck in local 
optima. In their work, the probability of a gene at list index j for being drawn is proportional 
to j~ T . This variation was called r-EO and showed superior performance compared to the 
simple Extremal Optimization. In the graph partitioning problem on which Boettcher and 
Percus [238] have worked, two genes from different sets needed to be drawn this way in each 
step, since always two nodes had to be swapped in order to keep the size of the sub-graphs 
constant. Values of r in 1.3. . . 1.6 have been reported to produce good results [238]. 

The major problem a user is confronted with in Extremal Optimization is how to de- 
termine the fitness contributions X(p.g[i]) of the elements p.g[i] of the genotypes p.g of the 
solution candidates p.x. Boettcher and Percus [239] point out themselves that the "drawback 
to EO is that a general definition of fitness for individual variables may prove ambiguous 
or even impossible" [526]. de Sousa and Ramos [524, 525, 526] therefore propose and exten- 
sion to EO, called the Generalized Extremal Optimization (GEO) for fixed-length binary 
genomes G = B™. Each gene (bit) p.g[i] in the element p.g of the search space currently 
examined, the following procedure is performed: 

1. Create a copy g' of p.g. 

2. Toggle bit i in g'. 
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3. Set X(p.g[i]) to — /(gpm(</)) for maximization and to /(gpm(</)) in case of minimiza- 
tion. 3 

By doing so, X(p.g[i]) becomes a measure for how adapted the gene is. If / is subject to 
maximization, high positive values of f(gpm(g')) (corresponding to low X(p.g[i])) indicate 
that gene i should be mutated and has a low fitness. For minimization, low /(gpm(</)) 
indicated the mutating gene i would yield high improvements in the objective value. 



13.3 General Information 
13.3.1 Areas Of Application 

Some example areas of application of Extremal Optimization are: 



Application References 

Combinatorial Optimization [1363, 237, 238, 236] 

Engineering, Structural Optimization, and Design [526, 762, 1753] 

Networking and Communication [1363] 

see Section 23.2 on page 401 
Function Optimization [524] 



3 In the original work of de Sousa et al. [526], f(x*) is subtracted from this value. Since we rank 
the genes, this has basically no influence and is omitted here. 
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Tabu Search 



14.1 Introduction 

Tabu Search 1 (TS) has been developed by Glover [810] in the mid 1980s [816]. Some of the 
basic ideas were introduced by Hansen [892] and further contributions in terms of formal- 
izing this method have been made by Glover [811, 812], and de Werra and Hertz [529] (as 
summarized by Hertz et al. [919] in their tutorial on Tabu Search) as well as by Battiti and 
Tecchiolli [158] and Cvijovic and Klinowski [471]. 

The word "tabu" 2 stems from Polynesia and describes a sacred place or object. Things 
that are tabu must be left alone and may not be visited or touched. Tabu Search extends hill 
climbing by this concept - it declares solution candidates which have already been visited as 
tabu. Hence, they must not be visited again and the optimization process is less likely to get 
stuck on a local optimum. The simplest realization of this approach is to use a list tabu which 
stores all solution candidates that have already been tested. If a newly created phenotype 
can be found in this list, it is not investigated but rejected right away. Of course, the list 
cannot grow infinitely but has a finite maximum length n. If the n + 1st solution candidate 
is added, the first one must be removed. Alternatively, this list could also be reduced with 
clustering. If some distance measure in the problem space X is available, a certain perimeter 
around the listed solution candidates can be declared as tabu. More complex approaches will 
store specific properties of the individuals instead of the phenotypes themselves in the list. 
This will not only lead to more complicated algorithms, but may also reject new solutions 
which actually are very good. Therefore, aspiration criteria can be defined which override 
the tabu list and allow certain individuals. 



1 http://en.wikipedia.org/wiki/Tabu_search [accessed 2007-07-03] 
http : //en. wikipedia. org/wiki/Tapu_°/o28Polynesian_culture°/o29 [accessed 2008-03-27] 
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Algorithm 14.1: x* < — tabuSearch(/, n) 



Input: /: the objective function subject to minization 

Input: n: the maximum length of the tabu list (n > 0) 

Data: p n ew'- the new element created 

Data: p*: the (currently) best individual 

Data: tabu: the tabu list 

Output: x*: the best element found 

1 begin 

2 p* .g < — create() 

// Implicitly: p* .x < gpm(p*.g) 

tabu < — createList(l,p*.:r) 



3 
4 
5 

6 
7 
8 
9 

10 

ll end 



while terminationCriterionQ do 
Pnew-g < — mutate(p* .g) 

II Implicitly: p new .x < gpm(p ne „ .g) 

if searchltem„(j3 neu ,.a;,ta6u) < then 

if f(Pnem-X) < f(p*.x) then p* < p n 

if \en(tabu) > n then tabu < — deleteListItem(tafou, 0) 
tabu < — addListItem(ta6M,p ne „.a:) 

return p* .x 



"new 



14.2 General Information 
14.2.1 Areas Of Application 

Some example areas of application of Tabu Search are: 



Application References 



Combinatorial Optimization 


[336, 1829, 1010, 815, 814, 983, 
2049, 1612, 47, 112, 285] 


Machine Learning 


[1855, 529] 


Biochemistry 


[1989] 


Operations Research 


[674] 


Networking and Communication 


[1641, 1643, 1642, 1683, 1953] 




see Section 23.2 on page 401 



14.2.2 Books 



Some books about (or including significant information about) Tabu Search are: 



Pardalos and Du [1612]: Handbook of Combinatorial Optimization 
Badiru [113]: Handbook of Industrial and Systems Engineering 
Reeves [1716]: Modern Heuristic Techniques for Combinatorial Problems 
Jaziri [1045]: Local Search Techniques: Focus on Tabu Search 



14.3 Multi-Objective Tabu Search 

The simple Tabu Search is very similar to hill climbing and Simulated Annealing, as you 
can see when comparing it with Chapter 10 on page 253 and Chapter 12 on page 263). With 
Algorithm 14.2, we thus can define a multi-objective variant for Tabu Search in a manner 
very similar to the multi-objective hill climbing or multi-objective Simulated Annealing. 
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Algorithm 14.2: 



tabuSearchMO(cmp F , n, a) 



Input: cmp F : the prevalence comparator 

Input: n: the maximum length of the tabu list (n > 0) 

Input: as: the maximum archive size 

Data: tabu: the tabu list 

Data: p new : the new individual generated 

Data: Arc: the set of best individuals known 

Output: X*: the set of the best elements found 

1 begin 

2 Arc < — () 

3 p new -x< — create() 

// Implicitly: p new .x< gpm(p new .g) 

4 tabu < () 

5 while terminationCriterionQ do 

6 if se&rchltem u (p new .x, tabu U Arc) < then 

7 Arc < — updateOptimalSet(ylrc, p n ew) 

8 Arc < — pruneOptimalSet( J 4rc, as) 

9 v* — assignFitness(Arc, cmp F ) 

10 if len(tafeu) > n then tabu < — deleteListItem(ia&u, 0) 

11 tabu < — addListItem(ta&tt, p„ e w -x) 

12 Pnew < — selectors, v, 1) [o] 

13 p new -g < — mutate(p ne „.g) 

// Implicitly: p„ e w -X < gpm(p ne „ .g) 

14 return extractPhenotypes(ylrc) 



15 end 
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Memetic and Hybrid Algorithms 



Starting with the research contributed by Bosworth ct al. [257] (1972), Bethke [190] (1980), 
and Brady [268] (1985), there is a long tradition of hybridizing evolutionary algorithms with 
other optimization methods such as hill climbing, Simulated Annealing, or Tabu Search 
[1893]. A comprehensive review on this topic has been provided by Grosan and Abra- 
ham [861, 862]. Such approaches are not limited to GAs as "basis", in Section 16.4 for 
example, we have already listed a wide variety of approaches to combine the downhill sim- 
plex with population-based optimization methods spanning from genetic algorithms to Dif- 
ferential Evolution and Particle Swarm Optimization. Today, many of these approaches can 
be subsumed under the umbrella term Memetic Algorithms 1 (MAs). 

15.1 Memetic Algorithms 

The principle of genetic algorithms is to simulate the natural evolution (where phenotypic 
features are encoded in genes) in order to solve optimization problems. The term Memetic 
Algorithm was coined by Moscato [1468, 1469] as allegory for simulating a social evolution 
(where behavioral patterns are passed on in memes 2 ) for the same purpose. The concept 
meme has been defined by Dawkins [501] as "unit of imitation in cultural transmission". 

Moscato [1468] uses the example of Chinese martial art Kung-Fu which has developed 
over many generations of masters teaching their students certain sequences of movements, 
the so-called forms. Each form is composed of a set of elementary aggressive and defensive 
patterns. These undecomposable sub-movements can be interpreted as memes. New memes 
are rarely introduced and only few amongst the masters of the art have the ability to do 
so. Being far from random, such modifications involve a lot of problem-specific knowledge 
and almost always result in improvements. Furthermore, only the best of the population 
of Kung-Fu practitioners can become masters and teach decibels. Kung-Fu fighters can 
determine their fitness by evaluating their performance or by competing with each other in 
tournaments. 

Based on this analogon, Moscato [1468] creates an example for the travelling salesman 
problem [1263, 78] involving the three principles of 

1. intelligent improvement based on local search with problem-specific operators, 

2. competition in form of a selection procedure, and 

3. cooperation in form of a problem-specific crossover operator. 

Further research work directly focusing on Memetic Algorithms has been contributed by 
Moscato et al. [1468, 1551, 952, 307, 220, 1470], Radcliffe and Surry [1693], Digalakis and 
Margaritis [565, 566], and Krasnogor and Smith [1215]. Other contributors of early work 

1 http://en.wikipedia.org/wiki/Memetic_algorithm [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/Meme [accessed 2008-09-10] 
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on genetic algorithm hybridization are Ackley [12] (1987), Goldberg [821] (1989), Gorges- 
Schleuter [835] (1989), Miihlenbein [1476, 1477, 1479] (1989), Brown et al. [294] (1989) and 
Davis [495] (1991). 

The definition of Memetic Algorithms given by Moscato [1468] is relatively general and 
encompass many different approaches. Even though Memetic Algorithms are a metaphor 
based on social evolution, there also exist two theories in natural evolution which fit to 
the same idea of hybridizing evolutionary algorithms with other search methods [1598]. 
Lamarckism and the Baldwin effect are both concerned with phenotypic changes in living 
creatures and their influence on the fitness and adaptation of species. 

15.2 Lamarckian Evolution 

Lamarckian evolution' 5 is a model of evolution accepted by science before the discovery of 
genetics. Superseding early the ideas of Erasmus Darwin [486] (the grandfather of Charles 
Darwin), de Lamarck [522] laid the foundations of the theory later known as Lamarckism 
with his book Philosophic Zoologiquc published in 1809. Lamarckism has two basic principles: 

1. Individuals can attain new, beneficial characteristics during their lifetime and lose unused 
abilities. 

2. They inherit their traits (also those acquired during their life) to their offspring. 

While the first concept is obviously correct, the second one contradicts the state of knowledge 
in modern biology. This does not decrease the merits of de Lamarck, who provided an early 
idea about how evolution could proceed. In his era, things like genes and the DNA simply 
had not been discovered yet. Weismann [2189] was the first to argue that the heredity 
information of higher organisms is separated from the somatic cells and, thus, could not be 
influenced by them [2067]. In nature, no phenotype-genotype mapping can take place. 

Lamarckian evolution can be "included" in evolutionary algorithms by performing a lo- 
cal search starting with each new individual resulting from applications of the reproduction 
operations. This search can be thought of as training or learning and its results are coded 
back into the genotypes jeG [2215]. Therefore, this local optimization usually works di- 
rectly in the search space G. Here, algorithms such as greedy search hill climbing, Simulated 
Annealing, or Tabu Search can be utilized, but simply modifying the genotypes randomly 
and remembering the best results is also possible. 

15.3 Baldwin Effect 

The Baldwin effect 4 , [1883, 2129, 2130] first proposed by Baldwin [123, 124], Morgan [1451, 
1452], and Osborn [1586] in 1896, is a evolution theory which remains controversial until 
today [511, 1646]. Suzuki and Arita [1985] describe it as a "possible scenario of interactions 
between evolution and learning caused by balances between benefit and cost of learning" 
[2163]. Learning is a rather local phenomenon, normally involving only single individuals, 
whereas evolution usually takes place in the global scale of a population. The Baldwin effect 
combines both in two steps [2067]: 

1. First, the lifetime learning gives the individuals the chance to adapt to their environment 
or even to change their phenotype. This phenotypic plasticity 5 may help the creatures to 
increase their fitness and, hence, their probability to produce more offspring. Different 
from Lamarckian evolution, the abilities attained this way do not influence the genotypes 
nor are inherited. 

3 http://en.wikipedia.org/wiki/Lamarckism [accessed 2008-09-10] 

4 http://en.wikipedia.org/wiki/Baldwin_effect [accessed 2008-09-10] 

5 http://en.wikipedia.org/wiki/Phenotypic_plasticity [accessed 2008-09-10] 
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2. In the second phase, evolution step by step generates individuals which can learn these 
abilities faster and easier and, finally, will have encoded them in their genome. Geno- 
typical traits then replace the learning (or phcnotypic adaptation) process and serve as 
an energy-saving shortcut to the beneficial traits. This process is called genetic assimi- 
lation* [2128, 2129, 2130, 2131]. 
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Fig. 15.1. a: The influence of learning capabilities of individuals on the 
fitness landscape according to [929, 930]. 
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Fig. 15.1.b: The positive and negative influence of learning capabilities 
of individuals on the fitness landscape as in [1985]. 



Figure 15.1: The Baldwin effect. 



Hinton and Nowlan [929, 930] were the first scientists performing experiments on the 
Baldwin effect with genetic algorithms [169, 904]. They found that the evolutionary inter- 
action with learning smoothens the fitness landscape [864] and illustrated this effect on 
the example of a needle-in-a-haystack problem similar to Fig. 15.1. a. Mayley [1374] used 
experiments on Kauffman's NK fitness landscapes [1100] (see Section 21.2.1) to show that 
the Baldwin effect can also have negative influence: Whereas learning adds gradient infor- 
mation in regions of the search space which are distant from local or global optima (case 
1 in Fig. 15.1.b), it decreases the information in their near proximity (called hiding effect 
[1374, 1062], case 2 in Fig. 15.1.b). One interpretation of this issue is that learning capabil- 
ities help individuals to survive in adverse conditions since they may find good abilities by 
learning and phenotypic adaptation. On the other hand, it makes not much of a difference 

6 http://en.wikipedia.org/wiki/Genetic_assimilation [accessed 2008-09-10] 
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whether an individual learns certain abilities or whether it was already born with them when 
it can exercise them at the same level of perfection. Thus, the selection pressure furthering 
the inclusion of good traits in the heredity information decreases if a life form can learn or 
adapt its phenotypes. 

Suzuki and Arita [1985] found that the Baldwin effect decreases the evolution speed in 
their rugged experimental fitness landscape, but also led to significantly better results in the 
long term. By the way, Suzuki maintains a very nice bibliography on the Baldwin effect at 

http : / /www . alif e . cs . is .nagoya-u. ac . jp/~reij i/baldwin/ [accessed 2008-09-10]. 

Like Lamarckian evolution, the Baldwin effect can also be added to evolutionary algo- 
rithms by performing a local search starting at each new offspring individual. Different from 
Lamarckism, the abilities and characteristics attained by this process only influence the ob- 
jective values of an individual and are not coded back to the genotypes. Hence, it plays no 
role whether the search takes place in the search space G or in the problem space X. The 
best objective values F(p'.x) found in a search around individual p become its own objective 
values, but the modified variant p' of p actually scoring them is discarded [2215]. Never- 
theless, the implementer will store these individuals somewhere else if they were the best 
solution candidates ever found. She must furthermore ensure that the user will be provided 
with the correct objective values of the final set of solution candidates resulting from the 
optimization process (F(p.x), not F(p'.x)). 

15.4 Summary on Lamarckian and Baldwinian Evolution 

Whitley et al. [2215] showed that both, Lamarckian and Baldwinian evolution, can improve 
the performance of a genetic algorithm. In their experiments, the Lamarckian strategies were 
generally faster but the Baldwin effect could provide better solution in some cases. Sasaki 
and Tokoro [1808] furthermore showed that Lamarckian search is better if the environment 
(i.e., the objective functions) is static whereas Baldwinian evolution leads to better results in 
dynamic landscapes. This is only logical since in the Lamarckian case, the configurations with 
the best objective values are directly encoded in the genome and we have highly specialized 
genotypes. When applying the Baldwin effect, on the other hand, the genotypes can remain 
general and only the phenotypes are adapted. The work of Paenke et al. [1598] on the 
influence of phenotypic plasticity on the genotype diversity further substantiates the positive 
effects of the Baldwin effect in dynamic environments. 

15.5 General Information 
15.5.1 Areas Of Application 

Some example areas of application of Memetic Algorithms are: 



Application References 

Combinatorial Optimization [220, 307, 1395, 901, 2043] 

Engineering, Structural Optimization, and Design [901] 

Biochemistry [901] 

Networking and Communication [2286, 1685] 



Scheduling 
Operations Research 



see Section 23.2 on page 401 

[901] 

[886] 



Function Optimization 
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[2042] 



15.5.2 Online Resources 

Some general, online available ressources on Memetic Algorithms are: 



http : //www . densis . f ee . unicamp . br/~moscato/memetic_home . html [accessed 2008-04-03] 
Last update: 2002-08-16 

Description: The Memetic Algorithms' Home Page by Pablo Moscato 



15.5.3 Books 

Some books about (or including significant information about) Memetic Algorithms are: 



Hart, Krasnogor, and Smith [901]: Recent Advances in Memetic Algorithms 

Corne, Dorigo, Glover, Dasgupta, Moscato, Poli, and Price [448]: New Ideas in Optimisation 

Glover and Kochenberger [813]: Handbook of Metaheuristics 

Grosan, Abraham, and Ishibuchi [862]: Hybrid Evolutionary Algorithms 
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Downhill Simplex (Nelder and Mead) 



16.1 Introduction 

The downhill simplex 1 (or Nelder-Mead method or amoeba algorithm 2 ) published by Nelder 
and Mead [1517] in 1965 is an single-objective optimization approach for searching the space 
of n-dimensional real vectors (G C R") [1561, 1230]. Historically, it is closely related to the 
simplex extension by Spendley et al. [1941] to the Evolutionary Operation method mentioned 
in Section 2.1.6 on page 101 [1276]. Since it only uses the values of the objective functions 
without any derivative information (explicit or implicit), it falls into the general class of 
direct search methods [2260, 2054], as most of the optimization approaches discussed in this 
book do. 

Downhill simplex optimization uses n+1 points in the R™. These points form a polytope 5 , 
a generalization of a polygone, in the n-dimensional space - a line segment in R 1 , a triangle 
in R 2 , a tetrahedron in R 3 , and so on. Nondegenerated simplexes, i.e., those where the set of 
edges adjacent to any vertex form a basis in the R™, have one important festure: The result 
of replacing a vertex with its reflection through the opposite face is again, a nondegenerated 
simplex (see Fig. 16.1. a). The goal of downhill simplex optimization is to replace the best 
vertex of the simplex with an even better one or to ascertain that it is a candidate for 
the global optimum [1276]. Therefore, its other points are constantly flipped around in an 
intelligent manner as we will outline in Section 16.3. 

Like hill climbing approaches, the downhill simplex may not converge to the global 
minimum and can get stuck at local optima [1230, 1383, 2046]. Random restarts (as in Hill 
Climbing with Random Restarts discussed in Section 10.5 on page 256) can be helpful here. 

16.2 General Information 
16.2.1 Areas Of Application 

Some example areas of application of downhill simplex are: 

1 http://en.wikipedia.org/wiki/Nelder-Mead_method [accessed 2008-06-14] 

2 In the book Numerical Recipes in C+ + by Press et al. [1675], this optimization method is called 
"amoeba algorithm" . 

3 http://en.wikipedia.org/wiki/Polytope [accessed 2008-06-14] 
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Application 


References 


Chemistry, Chemical Engineering 


[2142, 1401, 2127, 145] 


Robotics 


[371] 


Physics 


[1604, 1493] 


Biochemistry 


[2293] 


Data Mining and Data Analysis 


[1812] 



16.2.2 Online Resources 

Some general, online available ressources on downhill simplex are: 



http : //math . f ullerton . edu/mathews/n2003/NelderMeadMod . html [accessed 2008-06-14] 
Last update: 2004-07-22 

Description: Nelder-Mead Search for a Minimum 
http : //www . boomer . org/ c/p3/cl 1/cl 106 . html [acceded 2008-06-14] 
Last update: 2003 

Description: Nelder-Mead (Simplex) Method 



16.2.3 Books 

Some books about (or including significant information about) downhill simplex are: 



Avriel [89]: Nonlinear Programming: Analysis and Methods 

Walters, Morgan, Parker, Jr., and Deming [2142]: Sequential Simplex Optimization: A Tech- 
nique for Improving Quality and Productivity in Research, Development, and Manufacturing 
Press, Teukolsky, Vettering, and Flannery [1675]: Numerical Recipes in C++. Example Book. 
The Art of Scientific Computing 



16.3 The Downhill Simplex Algorithm 

In Algorithm 16.1, we define the downhill simplex optimization approach. For simplification 
purposes we set both, the problem and the search space, to the n-dimensional real vectors, 
i. e., X C G C W 1 . In the actual implementation, we can use any set as problem space, 
given that a genotype-phenotype mapping gpm : W 1 X is provided. Furthermore, notice 
that we optimize only a single objective function /. We can easily extend this algorithm 
for multi-objective optimization by using a comparator function cmp F based on a set of 
objective functions F instead of comparing the values of /. In Algorithm 10.2, we have 
created a multi-objective hill climbing method with the same approach. 

For visualization purposes, we apply the downhill simplex method exemplarily to an 
optimization problem with G = X = M 2 , where the simplex S consists of three points, in 
Figure 16.1. 

The optimization process described by Algorithm 16.1 starts with creating a sample of 
n+ 1 random points in the search space in line 2. Here, the createPop operation must ensure 
that these samples form a nondegenerated simplex. Notice that apart from the creation of 
the initial simplex, all further steps are deterministic and do not involve random numbers. 

In each search step, the points in the simplex S are arranged in ascending order according 
to their corresponding objective values (line 4). Hence, the best solution candidate is S[o] 
and the worst is S[n] . We then compute the center m of the n best points in line 5 and then 
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Algorithm 16.1: 



downhillSimplex(/) 



Input: /: the objective function subject to minimization 
Input: [implicit] n: the dimension of the search space 

Input: [implicit] a,p, 7, a: the reflection, the expansion, the contraction, and the shrink 

coemcent 
Data: S: the simplex 
Data: m: the centroid of the simplex 
Data: r: the reflexion 
Data: e: the expansion 
Data: c: the contraction 
Data: i: a counter variable 
Output: x*: the best solution candidate found 

begin 

S < — createPop(n + 1) 
while terminationCriterionQ do 
S< — sortList a (5, /) 

// Reflection: reflect the worst point over m 

r < — m + a (m — S[n]) 
if f(S[0]) < /(r) < /(£[„]) then 

I S[n] < — r 
else 

if /(r) < f(S l0] ) then 

// Expansion: try to search farther in this direction 
e < — r + 7 (r — m) 
if /(e) < /(r) then S[n] < — e 
else S[n] < — r 
else 

b < true 

if /W > f(S[n-i]) then 

// Contraction: a test point between r and m 
c < — pr + (1 — p)m 
if /(c) < /(r) then 

S[n] < C 

b < false 

if b then 

// Shrink towards the best solution candidate S[o] 
for i < — n down to 1 do 

J_ < — S[o] + o (S[i] - S[o]) 



return S[o] 



25 end 



reflect the worst solution candidate S[n] through this point in line 6, obtaining the new point 
r as also illustrated in Fig. 16.1. a. The reflection parameter a is usually set to 1. 

In the case that r is somewhere in between of the points in the current simplex, i. e., 
neither better than S[o] nor as worse as S[n], we directly replace S[n] with it. This simple 
move was already present in the first simplex algorithm defined by Spendley et al. [1941]. The 
contribution of Nelder and Mead [1517] was to turn the simplex search into an optimization 
algorithm by adding new options. These special operators were designed for speeding up 
the optimization process by deforming the simplex in way that they suggested would better 
adapt to the objective functions [1276]. 
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Figure 16.1: One possible step of the downhill simplex algorithm applied to a problem in 
M 2 . 



If r is better than the best solution candidate S[o], one of these operators is to expand 
the simplex further into this promising direction (line 11). As sketched in Fig. 16.1.b, we 
obtain the point e with the expansion parameter 7 set to 1. We now choose the better one 
of these two points in order to replace S[n]. 

If r was no better than S[n], we the simplex is contracted by creating a point c somewhere 
in between of r and m in line 17. In Fig. 16.1.C, the contraction parameter p was set to |. 
We substitute S[n] with c only if c is better than r,. 

When everything else fails, we shrink the whole simplex in line 23 by moving all points 
(except S[o]) into the direction of the current optimum S[o]. The shrinking parameter a 
normally has the value |, as is the case in the example outlined in Fig. 16.1.d. 

16.4 Hybridizing with the Downhill Simplex 

Interestingly, there are some similarities between evolutionary algorithms and the downhill 
simplex. Takahama and Sakai [1999], for instance, argue that the downhill simplex can be 
considered as an evolutionary algorithm with special selection and reproduction operators. 
Each search step of Nelder and Mead's algorithm could be regarded as an n-ary reproduction 
operation for search spaces that are subsets the R™. Also, there are vague similarities between 
search operations of Differential Evolution (described in Section 5.5 on page 229), Particle 
Swarm Optimization (introduced in Chapter 9 on page 249), and the reflection operator 
of downhill simplex. The joint work of Jakumeit ct al. [1038] and Barth et al. [155], for 
example, goes into the direction of utilizing these similarities. 

The research of Wang and Qiu [2144, 2146, 2145, 2147, 1680] focuses on the Hybrid 
Simplex Method PSO (HSMPSO), which, as the name says, is hybridization of Particle 
Swarm Optimization with Nelder and Mead's algorithm. In this approach, the downhill 
simplex operator is applied to each particle after a definite interval of iterations. Similar 
ideas of combining PSO with simplex methods are pursued by Fan et al. [643, 642]. 

Gao and Wang [770] emphasize the close similarities between the reproduction operators 
of Differential Evolution and the search step of the downhill simplex. Thus, it seems only 
logical to combine the two methods in form of a new Memetic Algorithm. The Low Dimen- 
sional Simplex Evolution (LDSE) of Luo and Yu [1333] incorporates the single search steps 
of the downhill simplex applied to a number m of points which is lower than the actual 
dimensionality of the problem n. Luo and Yu [1333, 1332] reported that this method is able 
to outperform Differential Evolution when applied to the test function set of Ali et al. [38]. 

There exist various other methods of hybridizing real-coded genetic algorithms with 
downhill simplex algorithm. Renders and Bersini [1719], for example, divide the population 
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into groups of n + 1 genotypes and allow the genetic algorithm to choose the Nelder-Mead 
simplex as additional reproduction operation besides crossover and simple averaging. The 
concurrent simplex of Yen ct al. [2293, 2294], [2292] uses a probabilistic simplex method with 
n + J? points instead of one, where the n best points are used to compute the centroid and 
the other fi > 1 points are reflected over it. They apply this idea to the top S individuals 
in the population, obtain S — n children, and copy the best n individuals into the next 
generation. The remaining (ps — S) genotypes (where ps is the population size) are created 
according to the conventional genetic algorithm reproduction scheme. 

Barbosa et al. [145] also use the downhill simplex as reproduction operator in a real- 
coded genetic algorithm. Their operator performs up to a given number of simplex search 
steps (20 in their work) and leads to improved results. Again, this idea goes more into 
the direction of Memetic Algorithms. Further approaches for hybridizing genetic algorithms 
with the downhill simplex have been contributed by Musil et al. [1493], Zhang ct al. [2314], 
[2313], and Satapathy et al. [1812]. 

The simplex crossover operator (SPX 4 ) by Tsutsui et al. [2063], [925, 2061] also uses 
a simplex structure based on n + 1 real vectors for n dimensional problem spaces. It is, 
however, not directly related to the downhill simplex search. 



4 This abbreviation is also used for single-point crossover, see Section 3.4.4. 
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State Space Search 



17.1 Introduction 

State space search strategies 1 are not directly counted to the optimization algorithms. In 
global optimization, objective functions f £ F are optimized. Normally, all elements of the 
problem space X are valid and only differ in their utility as solution. Optimization is about 
to find the elements with the best utility. State space search is instead based on a criterion 
isGoal : X i-> B which determines whether an element of the problem space is a valid solution 
or not. The purpose of the search process is to find elements x from the solution space S C X, 
i. e., those for which isGoal(x) is true. [1780, 446, 569] 

Definition 17.1 (isGoal). The function isGoal : X h 1 is the target predicate of state 
space search algorithms which states whether a given state x £ X is a valid solution (by 
returning true), i. e., the goal state, or not (by returning false). isGoal thus corresponds 
to membership in the solution space § defined in Definition 1.20 on page 42. 



We will be able to apply state space search strategies for global optimization if we 
can define a threshold i/i for each objective function fi £ F. If we assume minimization, 
isGoal(a;) becomes true if and only if the values of all objective functions for x drop below 
given thresholds, and thus, we can define isGoal according to Equation 17.2. 



In state space search algorithms, the search space G and the problem space X arc often 
identical. Most state space search can only be applied if the search space is enumerable. 
One feature of the search algorithms introduced here is that they all are deterministic. This 
means that they will yield the same results in each run (when applied to the same problem, 
that is). 

One additional operator needs to be defined for state space search algorithms: the 
"expand" function which helps enumerating the search space. 

Definition 17.2 (expand). The operator expand : G i— ► V{G) receives one element g from 
the search space G as input and computes a set G of elements which can be reached from 
it. 

"expand" is the exploration operation of state space search algorithms. Different from 
the mutation operator of evolutionary algorithms, it is strictly deterministic and returns a 
set instead of single individual. Applying it to the same element g values will thus always 

1 http://en.wikipedia.org/wiki/State_space_search [accessed 2007-08-06] 



isGoal(x) = true 



(17.1) 




(17.2) 
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yield the same set G. We can consider expand(g) to return the set of all possible results that 
we could obtain with a unary search operation like mutation. 

The realization of expand has severe impact on the performance of search algorithms. An 
efficient implementation, for example, should not include states that have already been vis- 
ited in the returned set. If the same elements are returned, the same solution candidates and 
all of their children will be evaluated multiple times, which would be useless and time con- 
suming. Another problem occurs if there are two elements <?i, </2 G G with g\ G expand^) 
and <?2 G expander) exist. Then, the search would get trapped in an endless loop. Thus, 
visiting a genotype twice should always be avoided. Often, it is possible to design the search 
operations in a way preventing this from the start. Otherwise, tabu lists should be used, as 
done in the previously discussed Tabu Search algorithm (see Chapter 14 on page 273). 

Since we want to keep our algorithm definitions as general as possible, we will keep the 
notation of individuals p that encompass a genotype p.g G G and a phenotype p.x G X. 
Therefore, we need to an expansion operator that returns a set of individuals P rather than 
a set G of elements of the search space. We therefore define the operation "expandToInd" 
in Algorithm 17.1. 



Algorithm 17.1: P < — cxpandToInd(g) 



Input: g: the element of the search space to be expanded 
Data: i: a counter variable 
Data: p: an individual record 

Output: P: the list of individual records resulting from the expansion 

1 begin 

2 G < — expand(g) 

3 P^() 

4 for i < — up to lcn(G) 1 do 

5 p.g < — G[i] 

II Implicitly: p* .x < gpm(p*.g) 

P < — addListItem(P, p) 

return P 
8 end 



For all state space search strategies, we can define four criteria that tell if they are 
suitable for a given problem or not. 

1. Completeness. Does the search algorithm guarantee to find a solution (given that there 
exists one)? (Do not mix up with the completeness of search operations specified 
in Definition 1.27 on page 44.) 

2. Time Consumption. How much time will the strategy need to find a solution? 

3. Memory Consumption. How much memory will the algorithm need to store intermediate 
steps? Together with time consumption this property is closely related to complexity 
theory, as discussed in Section 30.1.3 on page 550. 

4. Optimiality. Will the algorithm find an optimal solution if there exist multiple correct 
solutions? 

Search algorithms can further be classified according to the following definitions: 

Definition 17.3 (Local Search). Local search algorithms work on a single current state 
(instead of multiple solution candidates) and generally transcend only to neighbors of the 
current state [1780]. 

Local search algorithms are not systematic but have two major advantages: They use 
very little memory (normally only a constant amount) and are often able to find solutions 
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in large or infinite search spaces. These advantages come, of course, with large trade-offs in 
processing time. 

We can consider local searches as special case of global searches which incorporate larger 
populations. If the previously mentioned requirements are met, global search, in turn, can 
be regarded as a special case of global optimization algorithms. 



17.2 General Information 




17.2.1 Areas Of Application 




Some example areas of application of State space search 


are: 


Application 


References 


Networking and Communication 


[1918, 1637, 1952] 




see Section 23.2 on page 401 



17.2.2 Books 



Some books about (or including significant information about) State space search are: 

Russell and Norvig [1780]: Artificial Intelligence: A Modern Approach 
Bcdnorz [166]: Advances in Greedy Algorithms 



17.3 Uninformed Search 

The optimization algorithms that we have considered up to now always require some sort 
of utility measure. These measures, the objective functions, are normally real-valued and 
allow us to make fine distinctions between different individuals. Under some circumstances, 
however, only a criterion isGoal is given as a form of Boolean objective function. The methods 
previously discussed will then not be able to descend a gradient anymore and degenerate to 
random walks (see Section 17.3.5 on page 294). 

Here, uninformed search strategies 2 are a viable alternative since they do not require 
or take into consideration any knowledge about the special nature of the problem (apart 
from the knowledge represented by the expand operation, of course). Such algorithms are 
very general and can be applied to a wide variety of problems. Their common drawback is 
that search spaces are often very large. Without the incorporation of information in form 
of heuristic functions, for example, the search may take very long and quickly becomes 
infeasible [1780, 446, 569]. 

17.3.1 Breadth-First Search 

In breadth-first search 3 (BFS), we start with expanding the root solution candidate. Then 
all of the states derived from this expansion are visited, and then all their children, and so 
on. In general, we first expand all states in depth d before considering any state in depth 
d+l. 

It is complete, since it will always find a solution if there exists one. If so, it will also find 
the solution that can be reached from the root state with the least expansion steps. Hence, 
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Algorithm 17.2: X* < — bfs(r, isGoal) 



Input: r: the root individual to start the expansion at 

Input: isGoal: an operator that checks whether a state is a goal state or not 
Data: p: the state currently processed 
Data: P: the queue of states to explore 
Output: X*: the solution states found, or 

1 begin 

2 P< — crcatcList(l, r) 

3 while len(P) > do 
p < — deleteListftem(P, 0) 
if isGoal(p.x) then return {p.x} 
P < — appendList(P, expandToInd(p.c/)) 

return 
end 



if the number of expansion steps needed from the origin to a state is a measure for the costs, 
BFS is also optimal. 

Algorithm 17.2 illustrates how breadth-first search works. The algorithm is initialized 
with a root state r£G which marks the starting point of the search. BFS uses a state list 
P which initially only contains this root individual. In a loop, the first element p is removed 
from this list. If the goal predicate isGoal(p.a:) evaluates to true, p.x is a goal state and we 
can return a set X* = {p.x} containing it as the solution. Otherwise, we expand p.g and 
append the newly found individuals to the end of queue P. If no solution can be found, this 
process will continue until the whole accessible search space has been enumerated and P 
becomes empty. Then, an empty set is returned in place of X*, because there is no element 
x in the (accessible part of the) problem space X for which isGoal(x) becomes true. 

In order to examine the space and time complexity of BFS, we assume a hypothetical state 
space Gh where the expansion of each state jeG;, will return a set of len(expand(g)) = b 
new states. In depth 0, we only have one state, the root state r. In depth 1, there are b 
states, and in depth 2 we can expand each of them to again, b new states which makes b 2 , 
and so on. Up to depth d we have a number of states total of 

hd+l I 1 

l + b + b 2 + --- + b d = — —eO(b d ) (17.3) 

o — l 

We have both, a space and time complexity from O (6 d ) . In the worst case, all nodes in 
depth d need to be stored, in the best case only those of depth d—1. 

17.3.2 Depth-First Search 

Depth-first search 4 (DFS) is very similar to BFS. From the algorithmic point of view, the 
only difference that it uses a stack instead of a queue as internal storage for states (compare 
line 4 in Algorithm 17.3 with line 4 in Algorithm 17.2). Here, always the last state element 
of the set of expanded states is considered next. Thus, instead of searching level for level in 
the breath as BFS does, DFS searches in depth (which - believe it or not - is the reason for 
its name). DFS advances in depth until the current state cannot further be expanded, i. e., 
expand(p._f) = (). Then, the search steps again up one level. If the whole search space has 
been browsed and no solution is found, is returned. 

The memory consumption of the DFS is linear, because in depth d, at most d * b states 
are held in memory. If we assume a maximum depth m, the time complexity is 6 m in the 

2 http://en.wikipedia.org/wiki/Uninformed_search [accessed 2007-08-07] 

3 http://en.wikipedia.org/wiki/Breadth-first_search [accessed 2007-08-06] 

4 http://en.wikipedia.org/wiki/Depth-first_search [accessed 2007-08-06] 
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Algorithm 17.3: X* < — dfs(r, isGoal) 



Input: r: the root individual to start the expansion at 

Input: isGoal: an operator that checks whether a state is a goal state or not 
Data: p: the state currently processed 
Data: P: the queue of states to explore 
Output: X*: the solution states found, or 

1 begin 

2 P< — createList(l, r) 
while len(P) > do 

p < — deleteListItem(P, len(P) - 1) 
if isGoal(p.i) then return {p.x} 
P < — appendList(P, expandToInd(p.c/)) 

return 
end 



worst case where the solution is the last child state in the path explored the last. If m is 
very large or infinite, a DFS may take very long to discover a solution or will not find it at 
all, since it may get stuck in a "wrong" branch of the state space. Hence, depth first search 
is neither complete nor optimal. 

17.3.3 Depth-limited Search 

The depth- limited search [1780] is a depth- first search that only proceeds up to a given 
maximum depth d. In other words, it does not examine solution candidates that are more 
than d expand-operations away from the root state r, as outlined in Algorithm 17.4 in a 
recursive form. Analogously to the plain depth first search, the time complexity now becomes 
b d and the memory complexity is in 0(6 * d). Of course, the depth- limited search can neither 
be complete nor optimal. If a maximum depth of the possible solutions however known, it 
may be sufficient. 



Algorithm 17.4: X* < — dLdfs(r, isGoal, d) 



Input: r: the root individual to start the expansion at 

Input: isGoal: an operator that checks whether a state is a goal state or not 
Input: d: the (remaining) allowed depth steps 
Data: p: the state currently processed 
Output: X*: the solution states found, or 



l begin 



if isGoal(r.a;) then return {r.x} 
if d > then 

foreach p G cxpandToIndr .g do 
X* < — dl_dfs(p, isGoal, d - 1) 
if len(X*) > then return X* 



return 
8 end 



5 http://en.wikipedia.org/wiki/Depth-limited_search [accessed 2007-08-07] 
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17.3.4 Iterative Deepening Depth-First Search 

The iterative deepening depth-first search 6 (IDDFS, [1780]), defined in Algorithm 17.5, 
iteratively runs a depth-limited DFS with stepwise increasing maximum depths d. In each 
iteration, it visits the states in the according to the depth-first search. Since the maximum 
depth is always incremented by one, one new level in terms means of distance in expand- 
operations from the root is explored in each iteration. This effectively leads to some form of 
breadth-first search. 

IDDFS thus unites the advantages of BFS and DFS: It is complete and optimal, but only 
has a linearly rising memory consumption in 0(d * b). The time consumption, of course, is 
still in 0(& d ). IDDFS is the best uninformed search strategy and can be applied to large 
search spaces with unknown depth of the solution. 

The Algorithm 17.5 is intended for infinitely large search spaces. In real systems, there 
is a maximum d after which the whole space would be explored and the algorithm should 
return if no solution was found. 



Algorithm 17.5: X* < — iddfs(r, isGoal) 



Input: r: the root individual to start the expansion at 

Input: isGoal: an operator that checks whether a state is a goal state or not 

Data: d: the current depth limit 

Output: X*: the solution states found, or 

1 begin 

2 d < — 

3 repeat 

4 X* < — dl_dfs(r, isGoal, d) 

5 d < — d+ 1 
until len(X*) > 
return X* 

8 end 



17.3.5 Random Walks 



Random walks (sometimes also called drunkard's walk) are a special case of undirected, 
local search. Instead of proceeding according to some schema like depth-first or breadth- first, 
the next solution candidate to be explored is always generated randomly from the currently 
investigated one. [974, 649] Under some special circumstances, random walks can be the 
search algorithms of choice. This for instance the case in 

1. If we encounter a state explosion because there are too many states to which we can 
possible transcend to and methods like breadth-first search or iterative deepening depth- 
first search cannot be applied because they would consume too much memory. 

2. In certain cases of online search it is not possible to apply systematic approaches like 
BFS or DFS. If the environment, for instance, is only partially observable and each state 
transition represents an immediate interaction with this environment, we are maybe not 
able to navigate to past states again. One example for such a case is discussed in the 
work of Skubch [1897, 1898] about reasoning agents. 

6 http://en.wikipedia.org/wiki/IDDFS [accessed 2007-08-08] 

7 http://en.wikipedia.org/wiki/Random_walk [accessed 2007-11-27] 
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Random walks are often used in optimization theory for determining features of a fitness 
landscape. Measures that can be derived mathematically from a walk model include esti- 
mates for the number of steps needed until a certain configuration is found, the chance to 
find a certain configuration at all, and the average difference of the objective values of two 
consecutive populations. From practically running random walks, some information about 
the search space can be extracted. Skubch [1897, 1898], for instance, uses the number of en- 
counters of certain state transition during the walk in order to successively build a heuristic 
function. 



17.4 Informed Search 

In an informed search 8 , a heuristic function h helps to decide which states are to be expanded 
next. If the heuristic is good, informed search algorithms may dramatically outperform 
uninformed strategies [1407, 1711, 1626]. 

As specified in Definition 1.2 on page 22, heuristic functions are problem domain depen- 
dent. In the context of an informed search, a heuristic function /j:Xh E+ maps the states 
in the state space G to the positive real numbers R + . We further define that all heuristics 
will be zero for the elements which are part of the solution space §, i. e., 

Mx £ X : isGoal(a;) => h(x) = \f heuristics h:X^>R+ (17.4) 

There are two possible meanings of the values returned by a heuristic function h: 

1. In the above sense, the value of a heuristic function h(p.x) for an individual p is the 
higher, the more expand-steps p.g is probably (or approximately) away from finding a 
valid solution. Hence, the heuristic function represents the distance of an individual to 
a solution in solution space. 

2. The heuristic function can also represent an objective function in some way. Suppose that 
we know the minimal value y for an objective function / or at least a value from where 
on all solutions are feasible. If this is the case, we could set h(p.x) = max {0, f(p.x) — y}, 
assuming that / is subject to minimization. Now the value of heuristic function will be 
the smaller, the closer an individual is to a possible correct solution and Equation 17.4 
still holds. In other words, a heuristic function may also represent the distance to a 
solution in objective space. 

Of course, both meanings are often closely related since states that are close to each other 
in problem space are probably also close to each other in objective space (the opposite does 
not necessarily hold). 

A best-first search 9 [1626] is a search algorithm that incorporates such a heuristic function 
h in a way which ensures that promising individuals p with low estimation values h{p.x) are 
evaluated before other states q that receive a higher values h{q.x) > h{p.x). 

17.4.1 Greedy Search 



A greedy search 10 is a best-first search where the currently known solution candidate with 
the lowest heuristic value is investigated next. The greedy algorithm internal sorts the list 
of currently known states in descending order according to a heuristic function h. Thus, the 
elements with the best (lowest) heuristic value will be at the end of the list, which then 

8 http://en.wikipedia.org/wiki/Search_algorithms#Informed_search [accessed 2007-os os] 

9 http://en.wikipedia.org/wiki/Best-first_search [accessed 2007-09-25] 
10 http://en.wikipedia.org/wiki/Greedy_search [accessed 2007-08-08] 
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can be used as a stack. The greedy search as specified in Algorithm 17.6 now works like a 
depth-first search on this stack and thus, also shares most of the properties of the DFS. It 
is neither complete nor optimal and its worst case time consumption is b m . On the other 
hand, like breadth-first search, its worst-case memory consumption is also b m . 



Algorithm 17.6: X* < — greedySearch(r, isGoal, h) 



Input: r: the root individual to start the expansion at 

Input: isGoal: an operator that checks whether a state is a goal state or not 
Input: h: the heuristic function 
Data: p: the state currently processed 
Data: P: the queue of states to explore 
Output: X*: the solution states found, or 

1 begin 

2 P< — createList(l, r) 

3 while len(P) > do 

4 P « — sortListd(P, cmp(pi,p2) = h(pi.x) — h(p2.x)) 

5 p < — deleteListItem(P, len(P) - 1) 

6 if isGoal(p.x) then return {p.x} 

7 P < — appendList(P, expandToInd(p.<7)) 

8 return 



9 end 



17.4.2 A* search 

In A* search 11 is a best-first search that uses a estimation function h* : X i— » K + which is 
the sum of a heuristic function h(x) that estimates the costs needed to get from x to a valid 
solution and a function g(x) that computes the costs that are needed to get to x. 

h*{x) = g(x) + h(x) (17.5) 

A* search proceeds exactly like the greedy search outlined in Algorithm 17.6, if h* is 
used instead of plain h. An A* search will definitely find a solution if there exists one, i. e., 
it is complete. 

Definition 17.4 (Admissible Heuristic Function). A heuristic function /i:Xh M + is 
admissible if it never overestimates the minimal costs for reaching a goal state. 

Definition 17.5 (Monotonic Heuristic Function). A heuristic function /i:Xh» M+ is 
monotonic 12 if it never overestimates the costs for getting from one state to its successor. 

h(p.x) < g(q.x) — g(p.x) + h(q.x) Vq.g G cxpand(p.g) (17-6) 

An A* search is optimal if the heuristic function h used is admissible. Optimal in this 
case means that there exists no search algorithm that can find the same solution as the A* 
search needing fewer expansion steps if using the same heuristic. If we implement expand in 
a way which prevents that a state is visited more than once, h also needs to be monotone 
in order for the search to be optimal. 

11 http://en.wikipedia.org/wiki/Ayt2A_search [accessed 2007-08-09] 

12 see Definition 27.28 on page 463 



17.4 Informed Search 297 



17.4.3 Adaptive Walks 

An adaptive walk is a theoretical optimization method which, like a random walk, usually 
works on a population of size 1. It starts at a random location in the search space and 
proceeds by changing (or mutating) its single solution candidate. For this modification, 
three methods are available: 

1. One-mutant change: The optimization process chooses a single new individual from the 
set of "one- mutant change" neighbors, i. e., a neighboring individual differing from the 
current solution candidate in only one property. If the new individual is better, it replaces 
its ancestor, otherwise it is discarded. 

2. Greedy dynamics: The optimization process chooses a single new individual from the set 
of "one-mutant change" neighbors. If it is not better than the current solution candi- 
date, the search continues until a better one has been found or all neighbors have been 
enumerated. The major difference to the previous form is the number of steps that are 
needed per improvement. 

3. Fitter Dynamics: The optimization process enumerates all one-mutant neighbors of the 
current solution candidate and transcends to the best one. 

From these elaborations, it becomes clear that adaptive walks are very similar to hill 
climbing and Random Optimization. The major difference is that an adaptive walk is a 
theoretical construct that, very much like random walks, helps us to determine properties of 
fitness landscapes whereas the other two are practical realizations of optimization algorithms. 

Adaptive walks are a very common construct in evolutionary biology. Biological popula- 
tions are running for a very long time and so their genetic compositions are assumed to be rel- 
atively converged [807, 44]. The dynamics of such populations in near-equilibrium states with 
low mutation rates can be approximated with one-mutant adaptive walks [1903, 807, 44]. 
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Parallelization and Distribution 



As already stated many times, global optimization problems are often computational in- 
tense. Up until now, we have only explored the structure and functionality of optimization 
algorithms without paying attention to their potential of parallelization or even distribu- 
tion 1 . 

Roughly speaking, parallelization 2 means to search for pieces of code that can potentially 
run concurrently and letting them execute by different processors [1984, 184]. Take painting 
a fence for example. Here, the overall progress will be much faster if more than one painter 
applies the color to the wood. Distribution ' is a special case of parallelization where the 
different processors are located on different machines in a network [146, 2010]. Imagine that 
each fence-painter would take a piece of the fence to his workshop where he can use a special 
airbrush which can color the whole piece at once. Distribution comes with the trade-off of 
additional communication costs for transporting the data, but has the benefit that it is more 
generic. At the current time, off-the-shelf PCs usually have not more than two CPUs. This 
limits the benefit of local parallelization. We can, however, connect arbitrarily many of such 
computers in a network for distributed processing. 



18.1 Analysis 

In order to understand which parts of an optimization algorithm can be parallelized, the 
first step is an analysis. We will do such an analysis for evolutionary algorithms as example 
for population-based optimizers. 4 . The parallelization and distribution of evolutionary algo- 
rithms has long been a subject to study and has been discussed by multiple researchers like 
Alba and Tomassini [34], Cant'u-Paz [329, 330], Tan et al. [2003], Tanese [2007], Muhlenbein 
[1478], and Bollini and Piastra [244]. 

There are two components of evolutionary algorithms whose performance potentially can 
remarkably be increased by parallelization: the evaluation and the reproduction stages. As 
sketched in Figure 18.1, evaluation is a per-individual process. The values of the objective 
functions are determined for each solution candidate independently from the rest of the pop- 
ulation. Evaluating the individuals often involves complicated simulations and calculations 
and is thus usually the most time-consuming part of evolutionary algorithms. 

During the fitness assignment process, it is normally required to compare solution can- 
didates with the rest of the population, to compute special sets of individuals, or to update 
some data structures. This makes it very hard for parallelization to provide any speedup. 

1 Section 30.2 on page 553 gives a detailed introduction into distributed algorithms, their advan- 
tages and drawbacks. 

2 http://en.wikipedia.org/wiki/Parallelization [acceded 2007-07-03] 

3 http://en.wikipedia.org/wiki/Distributed_computing [accessed 2007-11-30] 

4 In Section 2.1.3 on page 98 you can find the basic evolutionary algorithm. 
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The selection phase may or may not require access to certain subsets of the population 
or data updates. Whether parallelization is possible or is beneficial thus depends on the 
selection scheme applied. 

The reproduction phase, on the other hand, can very easily be parallelized. It involves 
creating a new individual by using (but not altering) the information from n existing ones, 
where n = corresponds to the creation operation, n = 1 resembles mutation, and n = 2 
means recombination. Thus, the making of each new genotype is an independent task. 




Evaluation 



tm 
□□□□ 



Fitness 
Assignment 



Reproduction 




Figure 18.1: Parallelization potential in evolutionary algorithm. 



Despite running an evolutionary algorithm in single a thread of execution (see 
Figure 18.2), our analysis has shown that makes sense to have at least the evaluation and 
reproduction phase executed in parallel as illustrated in Figure 18.3. Usually, the population 
is larger than the number of available CPUs 6 , so one thread could be created per processors 
that consecutively pulls individuals out of a queue and processes them. This approach even 
yields performance gains on off-the-shelf personal computers since these nowadays at least 
come with hyper-threading' technology [2064, 1564] or even dual-core 8 CPUs [1165, 1891]. 



Initial. Eval. ^VFitness. 

X * 

Reprod,^ Select. 

— 

single thread / local machine 
Figure 18.2: A sequentially proceeding evolutionary algorithm. 



http://en.wikipedia.org/wiki/Thread_7.28computer_scienceyo29 [accessed 2007-07-03] 

6 http://en.wikipedia.org/wiki/Cpu [accessed 2007-07-03] 

7 http://en.wikipedia.org/wiki/Hyper-threading [accessed 2007-07-03] 

8 http://en.wikipedia.org/wiki/Dual-core [accessed 2007-07-03] 
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Figure 18.3: A parallel evolutionary algorithm with two worker threads. 



Cant'u-Paz [328] divides parallel evolutionary algorithms into two main classes: 

1. In globally parallelized EAs, each individual in the population can (possibly) always 
mate with any other. 

2. In coarse grained approaches, the population is divided into several sub-populations 
where mating inside a sub-population is unrestricted but mating between individuals of 
different sub-populations may only take place occasionally according to some rule. 

In ancient Greece, a deme was a district or township inhabited by a group that formed 
an independent community. They were the basic units of government in Attica as remodeled 
by Cleisthenes around 500 BC. In biology, a deme is a locally interbreeding group within a 
geographic population. 

Definition 18.1 (Deme). In evolutionary algorithms, a deme is a distinct sub-population. 

In the following, we are going to discuss some of the different parallelization methods 
from the viewpoint of distribution because of its greater generality. 



18.2 Distribution 

The distribution of an algorithm only pays off if the delay induced by the transmissions 
necessary for data exchange is much smaller than the time saved by distributing the com- 
putational load. Thus, in some cases distributing of optimization is useless. If searching for 
the root of a mathematical function for example, transmitting the parameter vector x to 
another computer will take much longer than computing the function f(x) locally. In this 
section, we will investigate some basic means to distribute evolutionary algorithms that can 
as well as be applied to other optimization methods as outlined by Weise and Geihs [2177]. 

18.2.1 Client-Server 

If the evaluation of the objective functions is time consuming, the easiest approach to dis- 
tribute and evolutionary algorithm is the client-server scheme (also called master-slave). 9 
Figure 18.4 illustrates how we can make use of this very basic, global distribution scheme. 
Here, the servers (slaves) receive the single tasks, process them, and return the results. 

9 A general discussing concerning the client-server architecture can be found in Section 30.2.2 on 
page 556 
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Such a task can, for example, be the reproduction of one or two individuals and the sub- 
sequent determination of the objective values of the offspring. The client (or master) just 
needs to distribute the parent individuals to the servers and receives their fully evaluated 
offspring in return. These offspring are then integrated into the population, where fitness 
assignment and selection is performed. Client-server-based distribution approaches for evo- 
lutionary algorithms have been discussed and tested by Van Veldhuizen et al. [2102], Xu 
et al. [2271], Dubreuil et al. [604] and were realized in general-purpose software packages by 
Cahon et al. [323], Luke et al. [1327], and Weise and Geihs [2177]. 
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Figure 18.4: An EA distributed according to the client-server approach. 



One practical realization of this approach can be to use a queue where all the selected 
individuals are pushed into as mating pool. Each server in the network is then represented 
by a thread on the client side. Such a thread pulls individuals from the queue, sends them 
to its corresponding server, and waits for the result to be returned. It places the individuals 
it receives into the new population and then starts over again. Servers may possess multiple 
processors, which can be taken into account by representing them by an appropriate number 
of threads. 



18.2.2 Island Model 

Under some circumstances, the client-server approach may not be optimal, especially if 

1. Processing of the tasks is fast relatively to the amount of time needed for the data 
exchange between the client and the server. In other words, if messages that have to be 
exchanged travel longer than the work would take if performed locally, the client-server 
method would actually slow down the system. 

2. Populations are required that cannot be held completely in the memory of a single 
computer. This can be the case either if the solution candidates are complex and memory 
consuming or the nature of the problem requires large populations. 

In such cases, we can again learn from nature. Until now we only have imitated evolution 
on one large continent. All individuals in the population compete with each other and 
there are no barriers between the solution candidates. In reality, there occur obstacles like 
mountain ranges or oceans, separating parts of the population and creating isolated sub- 
populations. Another example for such a scenario is an archipelago like the Galapagos islands 
where Darwin [485], the father of the evolution theory, performed his studies. On the single 
islands, different species can evolve independently. From time to time, a few individuals 
from one isle migrate another one, maybe by traveling on a tree trunk over the water or 
by been blown there by a storm. If they are fit enough in their new environment, they can 
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compete with the local population and survive. Otherwise, they will be extruded by the 
native residents of the habitat. This way, the islands manage an approximately equal level 
of fitness of their individuals, while still preserving a large amount of diversity. 
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Figure 18.5: An evolutionary algorithm distributed in a P2P network. 



We can easily copy this natural role model in evolutionary algorithms by using multiple 
sub-populations (demes) as discussed by Cohoon et al. [426], Martin et al. [1365], Skolicki and 
De Jong [1896, 1895], Gorges- Schleuter [836], Tanese [2007], and Toshine et al. [2048] and also 
realized in various software packages such as those created by Whitley and Starkweather 
[2213], Paechter et al. [1597], Tan et al. [2003], Arenas et al. [83], Chong and Langdon 
[398], Luke et al. [1327], Cahon et al. [323], and Weise and Geihs [2177]. Distributing the 
demes on n different nodes in a network of computers, each representing one island, is 
maybe the most popular form of coarse grained parallelization. Hence, both disadvantages 
of the original master/slave approach are circumvented: Communication between nodes is 
only needed when individuals migrate between them. This communication can be performed 
asynchronously to the n independently running evolutionary algorithms and does not slow 
down their performance. The migration rule can furthermore be chosen in a way that reduces 
the network traffic. By dividing the population, the number of solution candidates to be held 
on single machines also decreases, which helps to mitigate the memory consumption problem. 

The island model can be realized by peer-to-peer networks 10 where each node runs an 
independent evolution, as illustrated in Figure 18.5. Here, we have modified the selection 
phase which now returns some additional individuals to be transmitted to another node in 
the system. Depending on the optimization problems, solution candidates migrating over the 
network can either enter the fitness assignment process on the receiving machine directly 
or may take part in the evaluation process first. If the latter is the case, different objective 
functions can applied on different nodes. 



P2P networks are discussed in Section 30.2.2 on page 557. 
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Driving this thought further, one will recognize that the peer-to-peer approach inher- 
ently allows mixing of different optimization technologies, as outlined by Weise and Geihs 
[2177]. On one node, for instance, the SPEA2-algorithm (see ?? on page ??) can be per- 
formed, whereas another node could optimize according to plain hill climbing as described 
in Chapter 10 on page 253. Such a system, illustrated in Figure 18.6, has a striking advan- 
tage. From the No Free Lunch Theorem discussed in Section 1.4.10 on page 76, we know 
that for optimization algorithms perform differently for different problems. If the problem 
is unimodal, i. e., has exactly one global optimum and no local optima, a hill climbing ap- 
proach will outperform any other technique since it directly converges to this optimum. 
If the fitness landscape is rugged, one the other hand, methods like SPEA2 which have a 
very balanced exploration/exploitation proportion are able to yield better results and hill 
climbing algorithms will get stuck to local optima. In most cases, it is not possible to know 
beforehand which optimization strategy will perform best. Furthermore, the best approach 
may even change while the optimization proceeds. If a new, better individual evolves, i. e., a 
new optimum is approached, hill climbing will be fast in developing this solution candidate 
further until its best form is found, i.e., the bottom of the local optimum is reached. In other 
phases, an exploration of the solution space may be required since all known local optima 
have been tracked. A technology like Ant Colony Optimization could now come into play. 
A heterogeneous mixture of these algorithms that exchanges individuals from time to time 
will retain the good properties of the single algorithms and, in many cases, outperform a ho- 
mogeneous search [1275, 1023, 24, 95, 2283]. Just remember how our discussion of Memetic 
Algorithms in Chapter 15 on page 277. 



The island model can also be applied locally by simply using disjoint local populations. 
Although this would not bring a performance gain, it could improve the convergence behavior 
of the optimization algorithm. Spieth et al. [1943], for instance, argue that the island model 
can be used to preserve the solution diversity. By doing so it decreases the probability of 
premature convergence (see Section 1.4.2 on page 58). 

Broadcast-Distributed Parallel Evolutionary Algorithm 

The Broadcast-Distributed Parallel Evolutionary Algorithm (BDP) defined by Johnson et al. 
[1060] extends the island model for large networks of wirelessly connected, resource-restricted 
devices such as sensor networks, amorphous and paintable computing systems. In the BDP, 
each node carries a separate population from which one individual is selected after each 
generation. This individual is broadcasted to the neighbors of the node. Every time a node 




Figure 18.6: An example for a heterogeneous search. 
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receives an individual, it appends it to an internal mate list. Whenever the length of this 
list exceeds a certain threshold, selection and subsequent crossover is performed on the joint 
set of the population and the individuals in the mate list. 

18.2.3 Mixed Distribution 

Of course, we can combine the both distribution approaches previously discussed by having 
a peer-to-peer network that also contains client-server systems, as sketched in Figure 18.7. 
Such a system will be especially powerful if we need large populations of individuals that 
take long to evaluate. Then, the single nodes in the peer-to-peer network together provide 
a larger virtual population, while speeding up their local evolutions by distributing the 
computational load to multiple servers. 
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Figure 18.7: A mixed distributed evolutionary algorithms. 



18.3 Cellular Genetic Algorithms 

Cellular Genetic Algorithms [33] are a special family of parallelization models for genetic 
algorithms which has been studied by various researchers such as Whitley [2212, 2208], 
Mandcrick and Spiesscns [1353], Hillis [928], and Davidor [490]. A good understanding of 
this model can be reached by starting with the basic architecture of the cellular system as 
described by Whitley [2208]. 

Assume we have a matrix m of N x N cells. Each cell has a processor and holds one 
individual of the population. It can communicate with its right, left, top, and bottom neigh- 
bor. Cells at the edge of the matrix are wired with cells in the same column/row at the 
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opposite edge. The cell m/j can thus communicate with m(j+i) mod jv,j, ^(i-i) mod N,ji 
m »,(j+i) mod n, and rrii^j-i) mo d at- It is also possible to extend this neighborhood to all 
cells in a given Manhattan distance, but let us stick with the easiest case. 

Each cell can evaluate the individual it locally holds. For creating offspring, it can either 
mutate this individual or recombine it with one selected from of the four solution candidates 
on its neighbors. At the beginning, the matrix is initialized with random individuals. After 
some time, the spatial restriction of mating leads to the occurrence of local neighborhoods 
with similar solution candidates denoting local optima. These hoods begin to grow until they 
touch each other. Then, the regions better optima will "consume" worse ones and reduce 
the overall diversity. 

Although there are no fixed mating restrictions like in the island model, regions that 
are about twenty or so moves away will virtually not influence each other. We can consider 
groups of cells that distant as separate sub-populations. This form of separation is called 
isolation by distance - again, a term that originally stems from biology (coined by Wright 
[2261]). [2208, 432, 1478, 836] For observing such effects, it is said that a certain minimum 
of cells is required - at least about 1000 according to Whitley [2208]. 
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Maintaining the Optimal Set 



Most multi-objective optimization algorithms return a set of optimal solutions X* instead of 
a single individual x*. Many optimization techniques also internally keep track of the set of 
best solution candidates encountered during the search process. In Simulated Annealing, for 
instance, it is quite possible to discover an optimal element x* and subsequently depart from 
it to a local optimum x* . Therefore, optimizers normally carry a list of the non-prevailed 
solution candidates ever visited with them. 

In scenarios where the search space G differs from the problem space X it often makes 
more sense to store the list of optimal individual records P* instead of just keeping the 
optimal phenotypes x* . Since the elements of the search space are no longer required at the 
end of the optimization process, we define the simple operation extractPhenotypes which 
extracts them from a set of individuals P. 

\/x e extractPhenotypes(P) => 3p e P : x = p.x (19-1) 
19.1 Updating the Optimal Set 

Whenever a new individual p is created, the set of optimal individuals P* may change. It is 
possible that the new solution candidate must be included in the optimal set or even prevails 
some of the phenotypes already contained therein which then must be removed. 

Definition 19.1 (updatcOptimalSet). The function updateOptimalSet updates a set of op- 
timal elements P* ld with the new solution candidate p new .x. It uses implicit knowledge of 
the prevalence relation >- and the corresponding comparator function cmp F . 

p new = updateOptimalSet {P* ld ,p new ) , 

Vpi G P*oid tP2 G P* old ■ P2-xypi .x P* ew C P* u U {p new } A (19.2) 

Vpi e P* OT e P* OT : p 2 .xyp!.x 

We define two equivalent approaches in Algorithm 19.1 and Algorithm 19.2 which perform 
the necessary operations. Algorithm 19.1 creates a new, empty optimal set and successively 
inserts optimal elements whereas Algorithm 19.2 removes all elements which are prevailed 
by the new individual p new from the old optimal set P* ld . 

Especially in the case of evolutionary algorithms, not a single new element is created 
in each generation but a set P. Let us define the operation updatcOptimalSctN for this 
purpose. This operation can easily be realized by iteratively applying updateOptimalSet, as 
shown in Algorithm 19.3. 
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Algorithm 19.1: P* e 



updatcOptimalSet (P* id , p n ew ) 



Input: Paid- the optimal set as known before the creation of p n ew 
Input: p„ew' a new individual to be checked 

Output: Pnew'- the optimal set updated with the knowledge of p„ 
begin 



7 

8 end 



foreach p oid G P* ld do 
if .xypoid-x then 

Pnew * Pnew U {PoZd} 

if poid-x^p .x then return P* ld 



return P* e w U {p n ew} 



Algorithm 19.2: P* 



updateOptimalSet(P* w ,p„ ew ) (2 nd Version) 



Input: Paid'- the optimal set as known before the creation of p n ew 
Input: Pnew'- a new individual to be checked 

Output: Pnew'- the optimal set updated with the knowledge of p„ 



begin 

P* 



P* 

. r a id 
foreach p u G Plu do 
if .xypoid-x then 

else if p id-x>pnew-x then 
|_ return P* ld 



return P* ew U {p n ew} 



9 end 



Algorithm 19.3: P* ew < — updateOptimalSetN(P* w , P) 



Input: Paid'- the old optimal set 

Input: P: the set of new individuals to be checked for optimality 
Data: p: an individual from P 
Output: Pnew'- the updated optimal set 

l begin 

o p* - p* 

1 new 1 old 

3 foreach p G P do P^ e „ < — updatcOptimalSet (P* eu) ,p) 

4 return P^ e „ 



5 end 



19.2 Obtaining Optimal Elements 

The function updateOptimalSet helps an optimizer to build and maintain a list of optimal 
individuals. When the optimization process finishes, the extractPhenotypes can then be used 
to obtain the optimal elements of the problem space and return them to the user. However, 
not all optimization methods maintain an optimal set all the time. When they terminate, 
they have to extracts all optimal elements the set of individuals Pop currently known. 

Definition 19.2 (cxtractOptimalSet). The function extractOptimalSet function extracts a 
set P* of optimal (non-prevailed) individuals from any given set of individuals Pop. 

VP* C Pop C G x X, P* = extractOptimalSet (Pop) =4> Vpi € P* $p 2 € Pop : p 2 .xypi.x 

(19.3) 
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Algorithm 19.4 defines one possible realization of extractOptimalSet. By the way, this 
approach could also be used for updating an optimal set. 

updateOptimalSet(P* id ,p„ etu ) = extractOptimalSet (P* w U p new ) (19-4) 



Algorithm 19.4: P* 



extractOptimalSet (Pop) 



Input: Pop: the list to extract the optimal individuals from 
Data: p a ny,Pchk'- solution candidates tested for supremacy 
Data: i, j: counter variables 

Output: P*\ the optimal subset extracted from Pop 

begin 

P* 



Pop 

for i < — len(P*) — 1 down to 1 do 
for j < — i — 1 down to do 

if P*[{\yP*[j] then 

P* < — deleteListItem(P*,j) 
i < — i — 1 
else if P*[j]yP*[i] then 
|_ P* < — deleteListltem(PV) 



return listToSet(P*) 



ll end 



19.3 Pruning the Optimal Set 

In some optimization problems, there may be very many if not infinite many optimal individ- 
uals. The set of X* optimal solution candidates computed by the optimization algorithms, 
however, cannot grow infinitely because we only have limited memory. Therefore, we need 
to perform an action called pruning which reduces the size of the optimal set to a given 
limit k [1466, 1993, 427]. 

There exists a variety of possible pruning operations. Morse [1466] and Taboada and 
Coit [1992], for instance, suggest to use clustering algorithms 1 for this purpose. In principle, 
also any combination of the fitness assignment and selection schemes discussed in Chapter 2 
would do. It is very important that the loss of generality during a pruning operation is 
minimized. Ficldscnd et al. [667], for instance, point out that if the extreme elements of the 
optimal frontier are lost, the resulting set may only represent a small fraction of the optimal 
that could have been found without pruning. Instead of working on the set of optimal 
solution candidates X*, we again base our approach on a set of optimal individuals P* and 
we define: 

Definition 19.3 (pruncOptimalSct). The pruning operation pruneOptimalSet reduces the 
size of a set P* ld of individuals to fit to a given upper boundary k. 

VPL W C P* w C G x X, k € N : P* ew = prune0ptimalSet(P o * w , k) => |P* e J < k (19.5) 
19.3.1 Pruning via Clustering 

Algorithm 19.5 uses clustering to provide the functionality specified in this definition and 
thereby realizes the idea of Morse [1466] and Taboada and Coit [1992]. Basically, any given 
clustering algorithm could be used as replacement for cluster - see Chapter 29 on page 535 
for more information on clustering. 

1 You can find a discussion of clustering algorithms in Section 29.2. 
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Algorithm 19.5: P* ew < — pruneOptimalSet c (P* w , k) 



Input: Paid'- the optimal set to be pruned 

Input: k: the maximum size allowed for the optimal set (k > 0) 
Input: [implicit] cluster: the clustering algorithm to be used 

Input: [implicit] nucleus: the function used to determine the nuclei of the clusters 
Data: B: the set of clusters obtained by the clustering algorithm 
Data: b: a single cluster b £ B 
Output: Pnew'- the pruned optimal set 

begin 

// obtain k clusters 

B < — cluster(.P* w ) 



2 
3 
4 
5 

6 end 



foreach b G B do P„ ew < — F* eu , u nucleus(fe) 
return P* 



19.3.2 Adaptive Grid Archiving 

Let us discuss the adaptive grid archiving algorithm (AGA) as example for a more sophis- 
ticated approach to prune the optimal set. AGA has been introduced for the evolutionary 
algorithm PAES 2 by Knowles and Corne [1154] and uses the objective values (computed 
by the set of objective functions F) directly. Hence, it can treat the individuals as \F\- 
dimcnsional vectors where each dimension corresponds to one objective function / e F. 
This | | -dimensional objective space Y is divided in a grid with d divisions in each dimen- 
sion. Its span in each dimension is defined by the corresponding minimum and maximum 
objective values. The individuals with the minimum/maximum values are always preserved. 
This circumvents the phenomenon of narrowing down the optimal set described by Fieldsend 
ct al. [667] and distinguishes the AGA approach from clustering-based methods. Hence, it 
is not possible to define maximum optimal set sizes k which are smaller than 2|F|. If indi- 
viduals need to be removed from the set because it became too large, the AGA approach 
removes those that reside in regions which are the most crowded. 

The original sources outline the algorithm basically with with descriptions and defini- 
tions. Here, we introduce a more or less trivial specification in Algorithm 19.6 on the facing 
page and Algorithm 19.7 on page 312. The function agaDivide is internally used to perform 
the grid division. It transforms the objective values of each individual to grid coordinates 
stored in the array 1st. Furthermore, agaDivide also counts the number of individuals that 
reside in the same coordinates for each individual and makes it available in cnt. It en- 
sures the preservation of border individuals by assigning a negative cnt value to them. This 
basically disables their later disposal by the pruning algorithm pruneOptimalSet a9a since 
pruneOptimalSet aga deletes the individuals from the set P* ld that have largest cnt values 
first. 



2 PAES is discussed in ?? on page ?? 
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Algorithm 19.6: (Pi, 1st, cnt) < — agaDivide(P ;d, d) 



Input: Pou- the optimal set to be pruned 

Input: d: the number of divisions to be performed per dimension 

Input: [implicit] F: the set of objective functions 

Data: counter variables 

Data: mini,maxi,mul: temporary stores 

Output: (Pi,lst,cnt): a tuple containing the list representation Pi of P id, a list 1st 

assigning grid coordinates to the elements of Pi and a list cnt containing the 
number of elements in those grid locations 

begin 

mini * — createList(|P|, oo) 
maxi < — creatcList(|F|, — oo) 
for i < — \F\ — 1 down to do 

mini[i-\] < — min {fi(p.x) Vp € Paid} 
maxi\i~i] < — max{/,(p.i) Vp G P oH } 

mul < — createList(|F|, 0) 
for i < — \F\ — 1 down to do 
if maxi[i] / mini[i] then 

| mul[i] < m axi{i]-mini[i] 

else 

maxi[i] < — rnaxi[i] + 1 
miniu] < — minih] — 1 



Pi < — setToList(P oH ) 
1st < — crcateList(len(P() , 0) 
cnt < — createList(len(P;) , 0) 
for i < — len(P;) — 1 down to do 
lst[i] « — createList(|P|, 0) 
for j < — 1 up to \F\ do 

if {fj{Pi[i}) < miniu-i]) V (/j(P;H) > maxiy-i}) then 
j cnt[i] < — cnt[i] — 2 

lst[i]ij-i] < — l(fj(Pi{i]) - mini[j-i]) * mul[j-i]\ 

if cnt[i] > then 

for j < — i + 1 up to lcn(P ; ) 1 do 
if lst[i] = lst[j] then 
cnt[i] < — cnt[i] + 1 

if cnt{j] > then cnt[j] < — cnt[j] + 1 



return (Pi , 1st , cnt) 



29 end 
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Algorithm 19.7: P* e 



pruneOptimalSet (P* w , d, k) 



Input: P*old: the optimal set to be pruned 

Input: d: the number of divisions to be performed per dimension 

Input: k: the maximum size allowed for the optimal set (k > 2\F\) 

Input: [implicit] F: the set of objective functions 

Data: i: a counter variable 

Data: Pi : the list representation of P* ld 

Data: 1st: a list assigning grid coordinates to the elements of Pi 
Data: cut: the number of elements in the grid locations defined in 1st 
Output: Pnew'- the pruned optimal set 

begin 



l 

2 

3 
4 
5 
6 
7 

8 
9 

10 
11 
12 

13 



if len(P* H ) < k then return P* ld 
(Pi, 1st, cut) < — agaDivide(P* id , d) 
while len(Pi) > k do 
idx < — 

for i < — len(P() — 1 down to 1 do 

j if cnt[i] > cnt[idx] then idx < — i 

for i < — len(p) — 1 down to do 

|_ if (lst[i] = lst[idx]) A (cnt[i] > 0) then cnt[i] 

Pi < — deleteListItem(p, idx) 
cat « — deleteListItem(cnt, idx) 
1st < — deleteListItem(Zst, idx) 

return listToSet(P) 



cnt[i 



14 end 



Part II 
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Experimental Settings, Measures, and Evaluations 



In this chapter we will discuss the possible experimental settings, the things that we can 
measure during experiments, and what information we can extract from these measurements. 
We will also define some suitable shortcuts which we use later in descriptions of experiments 
in order to save space in tables and graphics. 

20.1 Settings 

Experiments can only be reproduced and understood if their setup is precisely described. 
Especially in the area of global optimization algorithms, there is a wide variety of possible 
parameter settings. Incompletely documenting an experiment may lead to misunderstand- 
ings. Other researchers repeating the tests will use default settings for all parameters not 
specified (or any settings that they find neat) and possibly obtain totally different results. 
Here, we will try to list many possible parameters of experiments (without claiming com- 
pleteness). Of course, not all of them are relevant in a given experiment. Instead, this list is 
to be understood as a hint of what to consider when specifying a test series. In many tables 
and graphics, we will use shortcuts of the parameter names in order to save space. 

20.1.1 The Optimization Problem 

As stated in Definition 1.34 on page 46, an optimization problem is a five-tuple 
(X, F, G, Op, gpm) specifying the problem space X, the objective functions F, the search 
space G, the set of search operations Op, and the genotype-phenotype mapping gpm. Spec- 
ifying the elements of this tuple is the most important prerequisite for any experiment. 
Table 20.1 gives an example for this structure. 

Parameter Short Description 

Problem X The space of possible solution candidates, (see Section 1.3.1) 

Space Example: The variable length natural vectors x = (xo, x\, . . . ) T , 

x, e N Vi e [0, lcn(x) - 1], < len(ar) < 500 VieX 
Objective F The objective functions which measure the utility of the solution 
Functions candidates. If nothing else is stated, minimization is assumed, (sec 

Definition 1.1) 

Example: F = {f 1 , f 2 } : ^(x) = S'^* 1 " 1 x u 
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Search Space G 



Search 
Operations 



GPM 



Op 



gpm 



The space of the elements where the search operations are applied 

On. (sec Section 1.3.1) 

Example: The variable length bit strings g : < len(g) < 2000. 
The search operations available for the optimizer. Here it is also 
important to note the way in which they are applied, (sec Section 1.3.1) 
Example: creation: uniform in length and values 

os = 1 — ► mv = 10% mutation, cr — 90% multi-point 
crossover 

os = 2 — ► mv = 25% mutation, cr = 80% multi-point 
crossover 

os = 3 — ► mv = 45% mutation, cr = 65% multi-point 
crossover 

The results from crossover may be mutated (non- 
exclusive search operations). 
The genotype-phenotype mapping translates points from the search 
space G to points in the problem space X. (see Definition 1.30) 
Example: x = gpm(.g) x, t = Ylf=4i9j, ten(x) = [\len(g)\ 



Table 20.1: The basic optimization problem settings. 

In this table we have defined a simple example optimization problem which has a search 
space composed of vectors with between 1 and 500 elements (natural numbers). These vectors 
are encoded as variable length bit strings, where groups of four bits stand for one vector 
element. As objective functions, two simple sums over the vector elements are applied. 
When needed, new strings with uniformly distributed length and contents are created with 
the creation operation. Sometimes, a test series involves tests with different settings. This 
is the case in this example too, where three configurations for the reproduction operations 
are given. In a table containing the results of the experiments, there could be a column 
"os" may contain the values from 1 to 3 corresponding to these settings. In our example, 
elements resulting from the crossover may be mutated (which is meant by "non-exclusive" ) . 
Therefore, the percentages in which the operators are applied do not necessarily need to 
sum up to 100%. With this definition, the problem can be reproduced easily and it is also 
possible to apply different global optimization algorithms and to compare the results. 

20.1.2 The Optimization Algorithm Applied 

The performance of an optimization algorithm strongly depends on its configuration. Ta- 
ble 20.1 lists some of the parameters most commonly involved in this book and gives examples 
how they could be configured. 



Parameter Short Description 



Optimization alg The optimization algorithm used to solve the problem, (see Defini- 
Algorithm tion 1.39) 



alg = 


0- 


-> (Parallel) Random Walks 


alg = 


1 - 


-> evolutionary algorithm 


alg = 


2 - 


-> Memetic Algorithm 


alg = 


3- 


-> Simulated Annealing 


alg = 


4- 


-> downhill simplex 



Comparison cm 
Operator 



Steady-State ss 



Population 

Size 

Elitism 



Fitness 

Assignment 

Algorithm 

Selection 
Algorithm 



ps 



el 



Maximum as 
Archive Size 



fa 



sel 



Tournament k 
Size 

Convergence cp 
Prevention 
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In multi-objective optimization, individuals are often compared ac- 
cording to Pareto or prevalence schemes. This parameter states 
which scheme was used, if any. (see Section 1.2.4) 

_ , cm = — » weighted sum 
Example: _ ; 

cm = 1 — > Pareto comparison 

Are the parent individuals in the population simply discarded (gen- 
erational) or do they compete with their offspring (steady-state). 
This parameter is usually only valid in the context of evolutionary 
algorithms or other population-oriented optimizers, (sec Section 2.1.6) 

„ , ss = — > generational 
Example: , ° . 

ss = 1 — > steady-state 

The population size (only valid for population-oriented algorithms). 

Example: ps <E {10, 100, 1000, 10 000} 

Are the best solution candidates preserved (elitism) or not (no 

elitism)? (see Definition 2.4) 

, el = — ► no elitism is used 
Example: . , .... . . 

ej = 1 — > elitism is used 

The size of the archive with the best known individuals (only valid 

if elitism is used). Notice: An archive size of zero corresponds to no 

elitism, (sec Definition 2.4) 

Example: as e {0, 10, 20, 40, 80} 

The fitness assignment algorithm used, (see Section 2.3) 

fa = — > weighted sum fitness assignment 
Example: fa = 1 — > Pareto ranking 

fa = 2 — > Variety Preserving Ranking 
The selection algorithm used, (see Section 2.4) 

se/ = — > Fitness proportionate selection 
Example: sel = 1 — > Tournament selection 

seZ = 2 — > Truncation Selection 
The number of individuals competing in tournaments in tourna- 
ment selection (only valid for seZ = 1). (see Section 2.4.8) 
Example: k G {2, 3, 4, 5} 

Is the simple convergence prevention method used? (see Section 2.4.8) 
Example: cp e {0, 0.1, 0.2, 0.3, 0.4} 



Table 20.2: Some basic optimization algorithm settings. 

Such a table describes a set of experiments if at least one of the parameters has more 
than one value. If several parameters can be configured differently, the number of experi- 
ments multiplies accordingly. Then, (full) factorial experiments 1 [263, 2288, 681] where all 
possible parameter combinations are tested separately (multiple times) can be performed. 
Factorial experiments are one basic design of experiments 2 (DoE) [682, 1149, 263, 460, 2288]. 
Since many parameter settings of evolutionary algorithms have influence each other [648] 
(for example elitism and steady state and selection and fitness assignment, mutation and 
crossover rate, etc.), it is insufficient to test the influence of each parameter separately. In- 
stead, DoE designs are recommended in order to determine the effect of these factors with 
efficient experiments. 

20.1.3 Other Run Parameters 

In Table 20.3, some additional parameters describing how the optimization algorithm was 
applied and how the experiments were carried out. 

1 http://en.wikipedia.org/wiki/Factorial_experiment [accessed 2008-08-07] 

2 http://en.wikipedia.org/wiki/Design_of_experiments [accessed 2008-10-14] 
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Parameter Short Description 



Number of 

Training 

Cases 

Training Case 

Change 

Policy 



Evaluation 
Limit 

Generation 
Limit 

Maximum 
Number of 
Runs 

Maximum 
Time per Run 
Maximum 
Total Time 
System 
Configuration 



tc 



ct 



The number of training cases used for evaluating the objective func- 
tions. 

Example: tee {1,10,20} 

The policy according to which the training cases are changed, (see 

Definition 1.39) 

ct = — > The training cases do not change. 
Example: ct = 1 — > The training cases change each generation. 

ct = 2 — > Training cases change after each evaluation. 
The maximum number of individual evaluations r that each run is 

allowed tO perform, (sec Definition 1.42) 

Example: mxr = 45 000 

The maximum number of iterations/generations t that each run is 

allowed tO perform, (see Definition 1.43) 

Example: mxt = 1000 

The (maximum) number of runs to perform. This threshold can 
be combined with time constrainints which may lead to fewer runs 
being performed. 
Example: mxr — 40 
mxrT The (maximum) amount of time granted per run. 
Example: mxrT = 40h 
The (maximum) total time granted. 
Example: mxT = 40d 5h 

Especially in cases where time constraints are imposed, the con- 
figuration of the system on which the experiments run becomes 
important. 

one 9 GHz two-core PC, 
Windows XP, Java 1.4 
one 9 GHz two-core PC, 
Windows XP, Java 1.6 



mxr 



mxt 



mxr 



mxT 



Cfg 



Example: 



Cfg = Q 
Cfg=l 



3 GiB RAM, 
3 GiB RAM, 



Table 20.3: Some additional parameters of experiments. 



20.2 Measures 

In Table 20.4 we list some basic measurements that easily can be taken from each test 
series of a single configuration. From these basic results, more meaningful metrics can be 
computed. 



Measure Short Description 



Number of 
Comp. Runs 



Success STi 
Evaluations 



The total number of completed runs with the specified configura- 
tion. 

Example: #r = 100 

The number of individual evaluations r performed in run i until the 
first individual with optimal functional objective values occured. 
This individual may have non-optimal non-functional objective val- 
ues Or be OVerfitted. (sec Definition 1.42) 

STi = 100 — ► 1 st successful individual in evaluation 100 
sn = — > no successful individual was found 



Example: 
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Success st i The number of iterations/generations t performed in run i until the 

Generations first individual with optimal functional objective values occurred. 

This individual may have non-optimal non-functional objective val- 
ues Or be OVerfitted. (sec Definition 1.43) 

sti = 800 — > 1 st successful individual in generation 800 
sti = — > no successful individual was found 
Perfection pn The number of individual evaluations r performed in run i until 
Evaluation a "perfect" individual with optimal functional objective values oc- 

curred. Depending on the context, this may be an individual where 
all objective values are optimal, a non-overfitted individual with 
optimal functional objectives, or both. Thus, pti > sti always holds 

(sec Definition 1.42) 

pTi = 100 — > I s * perfect individual in evaluation 100 
pTi = — > no perfect individual was found 
Perfection pti The number of iterations/generations t performed in run i until a 
Generation "perfect" individual with optimal functional objective values oc- 

curred. Depending on the context, this may be an individual where 
all objective values are optimal, a non-overfitted individual with 
optimal functional objectives, or both. Thus, pti > sti always holds 

(see Definition 1.43) 

pti = 800 — ► I s * perfect individual in generation 800 
pti = — > no perfect individual was found 
The set X* C X of solutions returned by run i. 

Example: X* = j(0, 1, 2) T , (4, 5, 6) T | 

Runtime rTi The total time needed by run i. 

Example: rTi = 312s 



Example: 



Example: 



Solution Set X? 



Table 20.4: Some basic measures that can be obtained from experiments. 

The distinction between successful and perfect individuals may seem confusing at first 
glance and is not necessary in many experiments. Often though, such a distinction is useful. 
Assume, for instance, that we want to optimize a schedule for a transportation or manufac- 
turing company. A successful schedule, in this case, would be one that allows the company 
to process all orders. This does not necessarily mean that it is optimal. Perfect would stand 
for optimal in this context while successful would translate to feasible. If we evolve programs 
with Genetic Programming and use training cases for their evaluation, we can consider a run 
as successful if it finds a program which works correctly on all training cases. This, however, 
could also be caused by overfitting. Then, a perfect program would be one that either also 
works correctly on another, larger set of test cases not used during the optimization process 
or whose correctness has been proven. An experimental run is called "successful" ( "perfect" ) 
if it found a successful ( "perfect" ) individual. The concrete definition of successful and per- 
fect is problem specific and must be stated whenever using these predicates. Furthermore 
notice that we use the sign in order to denote runs where no successful (or perfect) solution 
candidate was discovered. 



20.3 Evaluation 

20.3.1 Simple Evaluation Measures 

After a series of experiments has been carried out and the measures from Section 20.2 
have been collected, we can use them to compute some first, simple metrics which can 
serve as basis for deriving more comprehensive statistics. The simple metrics basically cover 
everything mentioned in Section 28.3: For a quantity q measured in multiple experimental 
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runs, we can compute the number #q of experiments that fulfilled the predicates attached 
to q and the estimators of the minimum q, mean q, maximum q, median med(g), and the 
standard deviation s [q] , and so on. Obviously, not all of them are needed or carry a meaning 
in every experiment. Table 20.5 lists some of these first metrics. 
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Measure Short Description 



Number of 

Successful 

Runs 
Success 

Fraction 

Minimum 

Success 

Evaluation 



Mean Success 
Evaluation 



Maximum 

Success 

Evaluation 



Minimum 

Success 

Generation 



Mean Success 
Generation 



Maximum 

Success 

Generation 



Number of 

Perfect Runs 

Perfection 

Fraction 

Minimum 

Perfection 

Evaluation 



Mean 

Perfection 

Evaluation 



S/r 



ST 



ST 



ST 



St 



St 



St 



P/r 
pT 



pr 



The number of runs where successful individuals were discovered. 

#s = \{i : (sTi ^ 0) A (0 < i < #r)}| (20.1) 

The fraction of experimental runs that turned out successful. 

sir=% (20.2) 
The number of evaluations r needed by the fastest (successful) ex- 
perimental run to find a successful individual, (or if no run was 
successful) 

f mm {^0} if 3^0 

\ otherwise v ' 

The average number of evaluations r needed by the (successful) 
experimental runs to find a successful individual, (or if no run 
was successful) 

P ' * 3*^0 (2 o.4) 

otherwise 



ST 



The number of evaluations r needed by the slowest (successful) 
experimental run to find a successful individual, (or if no run was 
successful) 

. = rmax{ S r^0}if 3s r i ^0 

I (0 otherwise v ' 

The number of generations/iterations t needed by the fastest (suc- 
cessful) experimental run to find a successful individual, (or if no 
run was successful) 



~_ J mm {sti ^ 0} if Bst, ^ 




(20.6) 



The average number of generations/iterations t needed by the (suc- 
cessful) experimental runs to find a successful individual, (or if 
no run was successful) 



St 



& if 3.^0 
otherwise 



(20.7) 



The number of Generations generations/iterations t needed by the 
slowest (successful) experimental run to find a successful individual, 
(or if no run was successful) 

2= j max {^0} if 3^0 
| (/) otherwise 

The number of runs where perfect individuals were discovered. 

# V = \{i : {pr t ^ 0) A (0 < i < #r)}\ (20.9) 
The fraction of experimental runs that found perfect individuals. 

Vh=% (20.10) 
The number of evaluations t needed by the fastest (perfect) exper- 
imental run to find a perfect individual, (or if no run was found 
one) 

fmin{pr^0} if 3pr^ 

( otherwise 

The average number of evaluations r needed by the (perfect) ex- 
perimental runs to find a perfect individual, (or if no run was 
found one) 

otherwise 



(20.12) 
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Maximum pr The number of evaluations r needed by the slowest (perfect) exper- 
Perfection imental run to find a perfect individual, (or if no run found one) 

Evaluation "~ = { m&X ^ ^ if ^ (20 13) 

pT | otherwise ^ ' ' 

Minimum pt The number of generations/iterations t needed by the fastest (per- 

Perfection feet) experimental run to find a perfect individual, (or if no run 

Generation was found one) 

1= rmin{^^0}if3^^0 

■ otherwise 

Mean pt The average number of generations/iterations t needed by the (per- 

Perfection feet) experimental runs to find a perfect individual, (or if no run 

Generation was found one) 

pt = { uSSl if 3pU ^ (20.15) 
! otherwise 

Maximum pt The number of generations/iterations t needed by the slowest (per- 
Perfection feet) experimental run to find a perfect individual, (or if no run 

Generation found one) 

1= rmax{^^0}if3^^0 
| otherwise 

Mean rT The arithmetic mean of the runtime consumed by the single runs. 

Runtime rT = rT i ( 2(U7 ) 

Table 20.5: Simple evaluation results. 

20.3.2 Sophisticated Estimates 

The average generation of success st is an estimate of the expected number of iterations 
needed by the optimizer to find a solution to the optimization problem specified. From 
the measurements taken during the experiments, however, we can also extract some more 
sophisticated estimates which give us more information about the optimization process. 

Cumulative Probability of Success etc. 

One of these estimate is the cumulative probability of success CPs{ps 1 1') introduced by Koza 
[1196]. It approximates the probability that a population-based optimization algorithm with 
a population size ps solves a given problem until iteration (generation) t' . Basically, it can 
easily be estimated from the experimental data as follows: 

CPs(p S ,t')= \ {sti:Sti - t,}l (20.18) 
#r 

The probability of solving the problem until the t ,th iteration at least once in st n inde- 
pendent runs then becomes approximately 1 — (1 — CPs(ps,t')) stn . If we want to find out 
how many runs with t' iterations we need to solve the problem with probability z, we have 
to solve z = 1 — (1 — CPs(ps, t')) st " . This equation can be solved and we obtain the function 
st n (z,ps, t'): 

f sgi^bs * < CPs(ps,t>)< z 
st n {z,ps,t') = t l if CPs{ps,t') > z (20.19) 

I. +oo otherwise 

From this value, we can directly compute an estimate of the number of objective function 
evaluations st n (z,ps,t') needed (i.e., the individuals to be processed) to find a solution with 
probability z if st n (z,ps,t') independent runs proceed up to t' iterations. If we have a 
generational evolutionary algorithm (i. e., ss = 1), this would be st n (z,ps,t') * ps * t' . In 
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a steady-state algorithm where all ps parents compete with a total of ps offspring and tc 
training cases are used that change each generation (ct = 1), we would require 2 * ps * 
tc * st n (z,ps, t') * t' evaluations. If the training cases where constant, we would not need to 
re-evaluate the parents in order to determine their objective values, and st n {z,ps,t r ) would 
become ps * tc* st n (z,ps,t') *t'. 

Obviously, how the value of st n (z, ps,t') is computed strongly depends on the optimiza- 
tion algorithm applied and may be different from case to case. For the sake of simplicity, we 
assume that even if we have |F| > 1 objective functions, all of them can be evaluated in one 
single evaluation step together if not explicitly stated otherwise. 

Since we often distinguish successful and perfect solutions in this book, we can easily 
derive the estimates CPp, pt n , and pr n analogously to CPs, st n , and sr n : 



20.4 Verification 

When obtaining measures like the mean number of individual evaluations sf needed to solve 
a given problem for multiple optimizers or for several configurations of the same optimization 
algorithm, one would tend to say that an algorithm/configuration A with sf(A) < sf{B) is 
better than an algorithm/configuration B. Such a statement should never be made without 
further discussion and statistical foundation. Never forget that measures or evaluation results 
obtained from experiments are always estimates 3 , i. e., guesses on the real parameter of an 
unknown probability distribution driving the process (optimization algorithm) which we 
have sampled with our measurements. An estimate should never be considered to be more 
than a direction, a pointer to an area, where the real values of the parameters are. 

20.4.1 Confidence Intervals or Statistical Tests 

So, instead of defining the mean number of evaluations to success as a single number, 
we could instead compute a confidence interval (see Section 28.7.3). A confidence in- 
terval specifies boundaries inside which the true value of the estimated quantity is lo- 
cated with a certain probability. Using a bit more math, we could derive an interval like 
P(1000 < E[st(A)} < 3000) > 90%, which is far more meaningful than just stating that 
st (A) = 2000. If the upper limit of the confidence interval of i£[sr(A)] is below the lower 
limit of the confidence interval for E[st(B)], it would indeed be justified to say that algorithm 
A performs better than B. 

Computing the conventional confidence intervals discussed in Section 28.7.3 has a draw- 
back when it comes to experiments. If you look up the examples there, you will find that all 
equations there assume that the measured quantity has one of the well-known probability 
distributions. In other words, for deriving the aforementioned interval for A, we would have 
to assume that the st(A) are often distributed, for instance. Of course we do not know if 
this is the case, and normally we cannot know. More often, we even have strong evidence 
that such an assumption would be rank nonsense. A normal distribution is a continuous 
distribution which stretches to infinity in both directions. Even if we ignore that sr surely is 
not a continuous but discrete quantity, it will definitely never be negative. Furthermore, if 

3 Some introduction on estimation theory can be found in Section 28.7, by the way. 



CPp(ps,t') 



\{pU : pU < t'}\ 



(20.20) 




log(l-z) 



(20.21) 
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the optimization algorithm used for the experiments is population-oriented, st is often con- 
sidered to be a multiple of the population size (see Section 20.3.2, for example). Computing 
a confidence interval using obviously wrong or even unverified/unverifiable assumptions is 
useless, wrong, and misleading. 

This brings us back to square one, to the quantities which we have derived with the 
previously discussed evaluation methods. But there exists another way to check whether 
sf(A) < sf(B) is significant or if this result is rather likely to be coincidence: statistical 
testing. Statistical tests are briefly discussed in Section 28.8 in this book. The idea of testing 
is that first, a so-called null hypothesis H is defined which states "E[st(A)] and E[st(B)] 
are equal." The alternative hypothesis H\ would be u E[st(A)] and E[st{B)\ are not equal." 
The goal is to find out whether or not there is enough statistical evidence to reject H and 
to accept Hi with a certain, small probability p of error. Of course, most of the hypothesis 
tests have the same problem than the conventional confidence intervals from Section 28.7.3, 
they assume certain probability distributions driving the measurements. The set of non- 
parametric tests discussed in Section 28.8.1 works without such assumptions. 

Thus, whenever we have insufficient information about the distribution of the samples, 
these tests are the method of choice for checking if experimental results indeed carry some 
meaning or not. Nevertheless, it is very important to realize that even these tests have 
certain requirements which must not be ignored. 

Interestingly, many statistical tests can be inverted and used to compute confidence 
intervals as noted in Section 28.7.3 which closes the circle of this section. 

20.4.2 Factorial Experiments 

If we have an experiment where multiple parameter configurations are tested, i.e., a factorial 
experiment [263, 2288, 681], we often want to find two things: 

1. the best possible configuration, and 

2. which settings of one parameter are (in average) good and which are bad. 

Notice that a configuration consisting only of the worst possible settings of all parameters 
can still be the best configuration possible - if the parameter settings interact. In Section 20.1 
we have already mentioned that this is often the case in optimization algorithms such as 
evolutionary algorithms. On the other hand, knowing general trends for certain parameters 
is valuable too. Obviously, the best observed parameter configuration is the one with the 
best mean or median performance. If it is significantly better than the others needs to be 
tested. 

Finding out whether a certain parameter configuration is good or not is relatively easy 
in factorial experiments. Assume we have run an experiment with the five parameters pop- 
ulation size (either ps = 512 or ps = 1024), convergence prevention (cp = or cp = 0.3), 
steady-state or generational populations (ss = 1 or ss = 0), mutation rates mr = 3% and 
mr = 15%, and crossover rates cr = 60% and cr = 80%. In the case of a full factorial 
experiment, we would thus test 2 5 = 32 different configurations. For each configuration, 
mxr experimental runs are performed. Assume furthermore that we are considered in the 
estimate p/r of the probability of finding a perfect solution in a single run and the expected 
number of individual evaluations pt n (z,ps,t') needed to find such a perfect solution with 
probability z (if runs of the evolutionary algorithm were performed with population size ps 
up to t' iterations). 

After all 32 * mxr runs, we would compute these measures for each single configuration. 
The influence of the two population size settings on the perfection rate p/r can then be 
estimated by dividing the testing configurations into two groups, those with ps = 1024 and 
those with ps = 512. For both groups, the arithmetic mean pTr and the median med(p/r) are 
computed separately and compared. In Table 20.6, we see that the mean and the median 
of the configurations with 1024 individuals in the population arc higher than for those with 
ps = 512. As one would expect, the larger population has a higher chance of completing 
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a run with finding a correct solution than the smaller population algorithm. Now we test 
this trend with the mentioned non-parametric tests 4 by pairwise grouping the results of the 
runs which have exactly the same configurations except for their population size settings. 
In the example Table 20.6, we have 16 such pairs and find that all three tests agree on the 
significance of the result. 



ps = 1024 vs. ps = 512 (based on 16 samples) 



Test according to p/r (higher is better) 

Sign test: med(p/r)| ps=1024 = 0.19, med(p/r)| ps=512 = 0.09, 

(sec Section 28.8.1) a ~ 0.0063 => significant at level a = 0.01 

Randomization test: pT r \ P ,= ia24 ~ 0.06, pT r \ P s=si2 = 0-0, 

(sec Section 28.8.1) a ~ 0.0024 significant at level a = 0.01 

Signed rankt test: -R(p/0l P s:io24-5i2 

= 114.0, 

(see Section 28.8.1) a ~ 0.019 not significant at level a = 0.01 
Test according to pr n (lower is better) 

Sign test: med(pr n )\ ps=1024 = 1.66 • 10 s , med(pr„)| Jls=512 = +00, 

(sec Section 28.8.1) Q ~ 0.1940 not significant at level a = 0.01 

Randomization test: W^\ P .,= i 24 = W^\ P ,=si2 = 

(sec Section 28.8.1) could not be applied 

Signed rankt test: i?(pr„)| I)s:1024 _ 512 = -94.0, 

(sec Section 28.8.1) a « 0.0601 => not significant at level a = 0.01 

Table 20.6: ps = 1024 vs. ps = 512 (based on 16 samples) 



For pr n , we can also find differences between the two groups. However, as it (could have) 
turned out, multiple configurations were not able to yield a solution in any of the runs (i. e., 
have p/r — 0) and thus, their pr n becomes infinite. Due to this configuration, the random- 
ization test could not be applied. Besides the numerical problems here, another reason why 
some of tests cannot be used would be if we had too many samples, for instance (see the 
discussion of the randomization test in Section 28.8.1). Although the larger population size 
again seems to better in the sample, the tests show that there is not enough evidence to 
support this expectations and that the result could have shown up coincidcntally. A more 
thorough example for this approach can be found in Section 21.3.2 on page 366. 



4 We can assume that both p/r and pr n are continuous quantities for large max. 
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Benchmarks and Toy Problems 



In this chapter, we discuss some benchmark and toy problems which are used to demonstrate 
the utility of global optimization algorithms. Such problems have usually no direct real-world 
application but are well understood, widely researched, and can be used 

1. to measure the speed and the ability of genetic algorithm algorithms to find good so- 
lutions, as done for example by Luke and Panait [1319], Borgulya [250], and Liu et al. 
[1295], 

2. as benchmark for comparing different optimization approaches, as done by Zitzler et al. 
[2330] and Purshouse and Fleming [1679], for instance, 

3. to derive theoretical results since they are normally well understood in a mathematical 
sense, as done for example by Jansen and Wegener [1039], 

4. as basis to verify theories, as used for instance by Burke et al. [308] and Langdon and 
Poli [1241], 

5. as playground to test new ideas, research, and developments, 

6. as easy-to-understand examples to discuss features and problems of optimization (as 
done here in Section 1.2.2 on page 27), 

7. for demonstration purposes, since they normally are interesting, funny, and can be vi- 
sualized in a nice manner. 



21.1 Real Problem Spaces 

Mathematical benchmark functions are especially interesting for testing and comparing tech- 
niques based on real vectors (X = M") like plain Evolution Strategy (see Chapter 5 on 
page 227), Differential Evolution (see Section 5.5 on page 229), and Particle Swarm Opti- 
mization (see Chapter 9 on page 249). However, they only require such vectors as solution 
candidates, i. e., elements of the problem space X. Hence, techniques with different search 
spaces G, like genetic algorithms, can also be applied to them, given that a genotype- 
phenotype mapping is provided accordingly. 

The optima or the Pareto frontier of benchmark functions has already been determined 
theoretically. When applying an optimization algorithm to the functions, we are interested 
in the number of solution candidates which they need to process in order to find the optima 
and how close we can get to them. They also give us a great opportunity to find out about 
the influence of parameters like population size, the choice of the selection algorithm, or the 
efficiency of reproduction operations. 

21.1.1 Single-Objective Optimization 

In this section, we list some of the most important benchmark functions for scenarios in- 
volving only a single optimization criterion. This, however, does not mean that the search 
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space has only a single dimension - even a single-objective optimization can take place in 
n-dimensiona! space R™. 

Sphere 

The sphere function listed by Suganthan et al. [1979] (or Fx by De Jong [512]) and defined 
here in Table 21.1 is a very simple measure of efficiency of optimization methods. They have, 
for instance, been used by Rechenberg [1713] for testing his Evolution Strategy-approach. 



function 


fsphereip^) 1. x i 


(21.1) 


domain 


X C 


R n , Xi e [-10,10] 


(21.2) 


optimum 


X* 


(0,0,..,0) T 


(21.3) 


separable 


yes 






multimodal 


no 







Table 21.1: The Sphere function. 



TODO 



21.1.2 Multi-Objective Optimization 

In this section, we list some of the most important benchmark functions for scenarios in- 
volving multiple objectives (see Section 1.2.2 on page 27). A comprehensive review on such 
problems is given by Huband et al. [972]. Other multi-objective problems can be found in 
[546]. 

TODO 



21.1.3 Dynamic Fitness Landscapes 

The moving peaks benchmarks independently developed by Branke [277, 278] and Morrison 
and Dc Jong [1465] in order to illustrate the behavior of dynamic environments as discussed 
in Section 1.4.9 on page 76. Figure 21.1 shows an example of this benchmark for a two- 
dimensional real parameter setting (the third dimension is the fitness). 
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Fig. 21.1.g: t = 7 Fig. 21.1.h: t = 13 

Figure 21.1: An example for the moving peaks benchmark of Branke [277, 278] 



21.2 Binary Problem Spaces 

21.2.1 Kauffman's NK Fitness Landscapes 

The ideas of fitness landscapes 1 and epistasis 2 came originally from evolutionary biology 
and later were adopted by evolutionary computation theorists. It is thus not surprising that 
biologists also contributed much to the research of both. In the late 1980s, Kauffman [1098] 
defined the NK fitness landscape [1100, 1098, 1101], a family of objective functions with 
tunable epistasis, in an effort to investigate the links between epistasis and ruggedness. 

The problem space and also the search space of this problem are bit strings of the length 
N, i.e., G = X = M N . Only one single objective function is used and referred to as fitness 

1 Fitness landscapes have been introduced in Section 1.3.2 on page 47. 

2 Epistasis is discussed in Section 1.4.6 on page 68. 
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function F N>K ■ V> N i-> R + - Each gene Xi contributes one value fi : M K+1 i-» [0,1] C M+ 
to the fitness function which is defined as the average of all of these N contributions. The 
fitness /$ of a gene Xi is determined by its allele and the alleles at K other loci x^ ,x i2 ,.., Xi K 
with i\._K € [0, N — 1] \ {i} C No, called its neighbors. 



Whenever the value of a gene changes, all the fitness values of the genes to whose neighbor 
set it belongs will change too - to values uncorrelated to their previous state. While N 
describes the basic problem complexity, the intensity of this epistatic effect can be controlled 
with the parameter K e 0..N: If K = 0, there is no epistasis at all, but for K = N — 1 the 
epistasis is maximized and the fitness contribution of each gene depends on all other genes. 
Two different models are defined for choosing the K neighbors: adjacent neighbors, where 
the K nearest other genes influence the fitness of a gene or random neighbors where K other 
genes are therefore randomly chosen. 

The single functions /j can be implemented by a table of length 2 K+1 which is indexed 
by the (binary encoded) number represented by the gene Xi and its neighbors. These tables 
contain a fitness value for each possible value of a gene and its neighbors. They can be filled 
by sampling an uniform distribution in [0, 1) (or any other random distribution). 

We may also consider the fi to be single objective functions that are combined to a 
fitness value F N ^ K by a weighted sum approach, as discussed in Section 1.2.2. Then, the 
nature of NK problems will probably lead to another well known aspect of multi-objective 
optimization: conflicting criteria. An improvement in one objective may very well lead to 
degeneration in another one. 

The properties of the NK landscapes have intensely been studied in the past and the 
most significant results from Kauffman [1099], Weinberger [2170], and Fontana et al. [721] 
will be discussed here. We therefore borrow from the summaries provided by Altenberg [44] 
and Defoin Platel et al. [549]. Further information can be found in [2258, 769, 393, 392]. An 
analysis of the behavior of estimation of distribution algorithms and genetic algorithms in 
NK landscapes has been provided by Pelikan [1632]. 



For K = 0, the fitness function is not epistatic. Hence, all genes can be optimized separately 
and we have the classical additive multi-locus model. 

1. There is a single optimum x* which is globally attractive, i. e., which can and will be 
found by any (reasonable) optimization process regardless of the initial configuration. 

2. For each individual x =/= x*, there exists a fitter neighbor. 

3. An adaptive walk' 5 from any point in the search space will proceed by reducing the 
Hamming distance to the global optimum by 1 in each step (if each mutation only 
affects one single gene) . The number of better neighbors equals the Hamming distance 
to the global optimum. Hence, the estimated number of steps of such a walk is y . 

4. The fitness of direct neighbors is highly correlated since it shares N — 1 components. 

K = N - 1 

For K = N — 1, the fitness function equals a random assignment of fitness to each point of 
the search space. 

1. The probability that a genotype is a local optimum is N 1 _ 1 . 

2. The expected total number of local optima is thus £ , 1 . 

3 See Section 17.4.3 on page 297 for a discussion of adaptive walks. 




N-l 



(21.4) 



i=0 



K = 
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3. The average distance between local optima is approximately 2 In (N — 1). 

4. The expected length of adaptive walks is approximately In (N — 1). 

5. The expected number of mutants to be tested in an adaptive walk before reaching a 

local optimum is Y^i=o ^ ^ 1 ^ ■ 

6. With increasing N, the expected fitness of local optima reached by an adaptive from a 
random initial configuration decreases towards the mean fitness F^,k = \ of the search 
space. This is called the complexity catastrophe [1099]. 

For K — N — 1, the work of Flyvbjerg and Lautrup [692] is of further interest. 



Intermediate K 



For small K, the best local optima share many common alleles. As K increases, this cor- 
relation diminishes. This degeneration proceeds faster for the random neighbors method 
than for the nearest neighbors approach. 

For larger K, the fitness of the local optima approach a normal distribution with mean 
m and variance s approximately 



m 



H + cr^2 In {K+l)K+l (21.5) 

s= ( K + ^ (216) 

N(K + 1 + 2(K + 2)\n(K +1)) V ' ' 

where \i is the expected value of the fi and a 2 is their variance. 

The mean distance between local optima, roughly twice the length of an adaptive walk, 
is approximately N ^(k^i)^ ■ 

The autocorrelation function 4 p(fc, fV^) and the correlation length r are: 

p(k,F N , K )= (l-^±I) (21.7) 

' (21-8) 



ln(l-^) 



Computational Complexity 

Altcnberg [44] nicely summarizes the four most important theorems about the computational 
complexity of optimization of NK fitness landscapes. These theorems have been proven using 
different algorithms introduced by Weinberger [2171] and Thompson and Wright [2040]. 

1. The NK optimization problem with adjacent neighbors is solvable in 0(2 K iV) steps and 
thus in V [2171]. 

2. The NK optimization problem with random neighbors is A/'P-complete for K > 2 [2171, 
2040]. 

3. The NK optimization problem with random neighbors and K = 1 is solvable in polyno- 
mial time. [2040]. 



Adding Neutrality NKp, NKq, and Technological Landscapes 

As we have discussed in Section 1.4.5, natural genomes exhibit a certain degree of neutrality. 
Therefore, researchers have proposed extensions for the NK landscape which introduce neu- 
trality, too [776, 777]. Two of them, the NKp and NKq landscapes, achieve this by altering 
the contributions fi of the single genes to the total fitness. In the following, assume that 
there are N tables, each with 2 K entries representing these contributions. 



4 See Definition 1.48 on page 63 for more information on autocorrelation. 
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The NKp landscapes devised by Barnctt [149] achieves neutrality by setting a certain 
number of entries in each table to zero. Hence, the corresponding allele combinations do not 
contribute to the overall fitness of an individual. If a mutation leads to a transition from one 
such zero configuration to another one, it is effectively neutral. The parameter p denotes the 
probability that a certain allele combination does not contribute to the fitness. As proven by 
Reidys and Stadler [1718], the ruggedness of the NKp landscape does not vary for different 
values of p. Barnett [148] proved that the degree of neutrality in this landscape depends on 
P- 

Newman and Engelhardt [1521] follow a similar approach with their NKq model. Here, 
the fitness contributions fi are integers drawn from the range [0, q) and the total fitness of 
a solution candidate is normalized by multiplying it with 1 /q-i. A mutation is neutral when 
the new allelic combination resulting from it leads to the same contribution than the old 
one. In NKq landscapes, the neutrality decreases with rising values of q. In [777], you can 
find a thorough discussion of the NK, the NKp, and the NKq fitness landscape. 

With their technological landscapes, Lobo et al. [1300] follow the same approach from 
the other side: the discretize the continuous total fitness function Fm.k- The parameter M 
of their technological landscapes corresponds to a number of bins [0, 1 /m), [ 1 /m 1 2 /m), . . . , 
into which the fitness values are sorted and put away. 

21.2.2 The p-Spin Model 

Motivated by the wish of researching the models for the origin of biological information by 
Anderson [51, 1747] and Tsallis and Ferreira [2056], Amitrano et al. [48] developed the p- 
spin model. This model is an alternative to the NK fitness landscape for tunable ruggedness 
[2172]. Other than the previous models, it includes a complete definition for all genetic 
operations which will be discussed in this section. 

The p-spin model works with a fixed population size ps of individuals of an also fixed 
length N. There is no distinction between genotypes and phenotype, in other words, G = X. 
Each gene of an individual x is a binary variable which can take on the values —1 and 1. 

G = {-i,i} 7V , xMe {-1,1} Vie [i..ps\,je [o.JV-i] (21.9) 

On the 2^ possible genotypic configurations, a space with the topology of an TV-dimensional 
hypercube is defined where neighboring individuals differ in exactly one element. On this 
genome, the Hamming distance dist# am can be defined as 

1 N ~ 1 

dist Ham (x 1 ,x 2 ) = 7,^2 (i-^iW^W) (21.10) 

Two configurations are said to be v mutations away from each other if they have the Ham- 
ming distance v. Mutation in this model is applied to a fraction of the N * ps genes in the 
population. These genes are chosen randomly and their state is changed, i. e., x[i] — ► —x[i]. 

The objective function fx (which is called fitness function in this context) is subject to 
maximization, i. e., individuals with a higher value of fx are less likely to be removed from 
the population. For very subset z of exactly K genes of a genotype, one contribution A(z) 
is added to the fitness. 

fx(x)= ^ A(x[z[0]],X[z[i]],...,X[zlK-i]]) (21.11) 

VzeV([O..K])A\z\=K 

A(z) = a z * z[o] * z[i] * ■ ■ ■ * z[K-i\ is the product of an evenly distributed random number a z 
and the elements of z. For K = 2, f 2 can be written as fa{x) = X^o* Sj=o aijX[i]x\j], which 
corresponds to the spin-glass [208, 1402] function first mentioned by Anderson [51] in this 
context. With rising values of K, this fitness landscape becomes more rugged. Its correlation 
length t is approximately N /2K, as discussed thoroughly by Weinberger and Stadler [2172]. 
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For selection, Amitrano et al. [48] suggest to use the measure Pd {x) defined by Rokhsar 
et al. [1747] as follows: 

Pd{x) = a7T~ £T! (21.12) 

where the coefficient is a sharpness parameter and H is a threshold value. For — > oo, 
all individuals x with /k(x) < H will die and for = 0, the death probability is always |. 
The individuals which have died are then replaced with copies of the survivors. 



21.2.3 The ND Family of Fitness Landscapes 

The ND family of fitness landscape has been developed by Beaudoin et al. [161] in order to 
provide a model problem with tunable neutrality. 

The degree of neutrality v is defined as the number (or, better, the fraction of) neutral 
neighbors (i.e., those with same fitness) of a solution candidate, as specified in Equation 1.42 
on page 64. The populations of optimization processes residing on a neutral network 
(see Section 1.4.5 on page 66) tend to converge into the direction of the individual which has 
the highest degree of neutrality on it. Therefore, Beaudoin et al. [161] create a landscape 
with a predefined neutral degree distribution. 

The search space is again the set of all binary strings of the length N, G = X = M N . 
Thus, a genotype has minimally and at most N neighbors with Hamming distance 1 that 
have the same fitness. The array D has the length N + 1 and the element £>[»] represents 
the fraction of genotypes in the population that have i neutral neighbors. 

Beaudoin et al. [161] provide an algorithm that divides the search space into neutral 
networks according to the values in D. Since this approach cannot exactly realize the dis- 
tribution defined by D, the degrees of neutrality of the single individuals are subsequently 
refined with a Simulated Annealing algorithm. The objective (fitness) function is created in 
form of a complete table mapping X i— » R. All members of a neutral network then receive 
the same, random fitness. 

If it is ensured that all members in a neutral network always have the same fitness, 
its actual value can be modified without changing the topology of the network. Tunable 
deceptiveness is achieved by setting the fitness values according to the Trap Functions [540, 
12, 1069]. 



Trap Functions 

Trap functions /b, r , x * : B W >— » R are subject to maximization based on the Hamming distance 
to a pre-defined global optimum x*. They build a second, local optimum in form of a hill 
with a gradient pointing away from the global optimum. This trap is parameterized with 
two values, b and r, where b corresponds to the width of the attractive basins and r to their 
relative importance. 

r 1 - dirt g .m(s,»*) if N » d ist garo ( a; ,X*) < b 
Jb,r,x*{x) = < r(j T dist Ham (x,x*)-b) . (21.13) 

I i—b otnerwisc 

Equation 21.14 shows a similar "Trap" function defined by Ackley [12] where u{x) is the 
number of ones in the bit string x of length n and z = [3n/4j [1069]. The objective function 
f(x) is subject to maximization is sketched in Figure 21.2. 

, , _ J (8n/z) (z - u(x)) if u{x) < z . . 

J{ > \{10n/{n-z)){u{x)-z) otherwise { > 



334 21 Benchmarks and Toy Problems 




Figure 21.2: Ackley's "Trap" function [12, 1069]. 



21.2.4 The Royal Road 

The Royal Road functions developed by Mitchell et al. [1432] and presented first at the 
Fifth International Conference on Genetic Algorithms in July 1993 are a set of special 
fitness landscapes for genetic algorithms [1067, 1432, 731, 1682, 2098]. Their problem space 
X and search space G are fixeddength bit strings. The Royal Road functions are closely 
related to the Schema Theorem 5 and the Building Block Hypothesis'' and were used to 
study the way in which highly fit schemas are discovered. They therefore define a set of 
schemas S — s±, S2, ■ ■ ■ , s n and an objective function (here referred to as fitness function), 
subject to maximization, as 

f(x)=^c(8)tT(8,x) (21.15) 

VsGS 

where a: € X is a bit string, c(s) is a value assigned to the schema s and o~(s, x) is defined as 

, . J 1 if x is an instance of s , . 

y \ otherwise 



In the original version, c(s) is the order of the schema s, i. e., c(s) = order(s), and S is 
specified as follows (where * stands for the don't care symbol as usual). 
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64 



Listing 21.1: An example Royal Road function. 



5 See Section 3.6 on page 150 for more details. 

6 The Building Block Hypothesis is elaborated on in Section 3.6.5 on page 152 
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By the way, from this example, we can easily see that a fraction of all mutation and 
crossover operations applied to most of the solution candidates will fall into the don't care 
areas. Such modifications will not yield any fitness change and therefore are neutral. 

The Royal Road functions provide certain, predefined stepping stones (i. e., building 
blocks) which (theoretically) can be combined by the genetic algorithm to successively create 
schcmas of higher fitness and order. Mitchell et al. [1432] performed several tests with their 
Royal Road functions. These tests revealed or confirmed that 

1. Crossover is a useful reproduction operation in this scenario. Genetic algorithms which 
apply this operation clearly outperform hill climbing approaches solely based on muta- 
tion. 

2. In the spirit of the Building Block Hypothesis, one would expect that the intermediate 
steps (for instance order 32 and 16) of the Royal Road functions would help the genetic 
algorithm to reach the optimum. The experiments of Mitchell ct al. [1432] showed the 
exact opposite: leaving them away speeds up the evolution significantly. The reason is 
the fitness difference between the intermediate steps and the low-order schemas is high 
enough that the first instance of them will lead the GA to converge to it and wipe out 
the low-order schemas. The other parts of this intermediate solution play no role and 
may allow many zeros to hitchhike along. 

Especially this last point gives us another insight on how we should construct genomes: 
the fitness of combinations of good low-order schemas should not be too high so other good 
low-order schemas do not extinct when they emerge. Otherwise, the phenomenon of domino 
convergence researched by Rudnick [1773] and outlined in Section 1.4.2 and Section 21.2.5 
may occur. 



Variable-Length Representation 

The original Royal Road problems can be defined for binary string genomes of any given 
length n, as long as n is fixed. A Royal Road benchmark for variable-length genomes has 
been defined by Defoin Platel et al. [548]. 

The problem space Xz of the VLR (variable-length representation) Royal Road problem 
is based on an alphabet S with N = \S\ letters. The fitness of an individual x e X s is 
determined by whether or not consecutive building blocks of the length b of the letters I € £ 
are present. This presence can be defined as 

B (xl)= f 1 if 3* : (0 < i < (lcn(x) - &)) A {x[i+j] = I Vj : < j < (b - 1)) 
b - ' ' |0 otherwise ^ ' ' 



1. Where b > 1 is the length of the building blocks, 

2. S is the alphabet with N — \U\ letters, 

3. 2 is a letter in S, 

4. x € X s is a solution candidate, and 

5. x[k] is the k th locus of x. 

Bf,(x,l) is 1 if a building block, an uninterrupted sequence of the letter I, of at least length 
b, is present in x. Of course, if len(x) < b this cannot be the case and Bb(x, I) will be zero. 

We can now define the functional objective function fjjb '■ Xz i— > [0, 1] which is subject 
to maximization as 

1 N 

hb{x) = -Y J B b (.x,E[i\) (21.18) 

An optimal individual x* solving the VLR Royal Road problem is thus a string that 
includes building blocks of length b for all letters I € S. Notice that the position of these 
blocks plays no role. The set X£ of all such optima with fsb( x *) = 1 is then 
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X£ = {x* G *e : B b (x*, l) = lVleE} 



(21.19) 



Such an optimum x* for b = 3 and £ = {A, T, G, C} is 

x* = AAA G TGGG TAATTT TCCC TCCC (21.20) 

The relevant building blocks of x* are written in bold face. As it can easily be seen, their 
location plays no role, only their presence is important. Furthermore, multiple occurrences of 
building blocks (like the second CCC) do not contribute to the fitness. The fitness landscape 
has been designed in a way ensuring that fitness degeneration by crossover can only occur 
if the crossover points are located inside building blocks and not by block translocation or 
concatenation. In other words, there is no inter-block epistasis. 

Epistatic Road 

Defoin Platel et al. [549] combined their previous work on the VLR Royal Road with Kauff- 
man's NK landscapes and introduced the Epistatic Road. The original NK landscape works 
on binary representation of the fixed length N. To each locus i in the representation, one 
fitness function fi is assigned denoting its contribution to the overall fitness, fi however is 
not exclusively computed using the allele at the i th locus but also depends on the alleles of 
K other loci, its neighbors. 

The VLR Royal Road uses a genome based on the alphabet E with N = \en.{E) letters. 
It defines the function B b (x,l) which returns 1 if a building block of length b containing 
only the character I is present in x and otherwise. Because of the fixed size of the alphabet 
E, there exist exactly N such functions. Hence, the variable-length representation can be 
translated to a fixed-length, binary one by simply concatenating them: 

B b {x, E[o]) B b (x, E[i]) . . . B b (x, E[n-i]) (21.21) 

Now we can define a NK landscape for the Epistatic Road by substituting the B b (x, I) 
into Equation 21.4 on page 330: 

1 N ~ 1 

F N , K ,b(x) = jf fi( B b(x, B b {x, E[n]), B b {x, E[i K ])) (21.22) 

i=0 

The only thing left is to ensure that the end of the road, i. e., the presence of all N 
building blocks, also is the optimum of Fpj.K.b- This is done by exhaustively searching the 
space M N and defining the fi in a way that B b (x, !) = lVie£^> FN,K,b(x) = 1. 

Royal Trees 

An analogue of the Royal Road for Genetic Programming has been specified by Punch et al. 
[1678]. This Royal Tree problem specifies a series of functions A, B, C, ... with increasing 
arity, i. e., A has one argument, B has two arguments, C has three, and so on. Additionally, 
a set of terminal nodes x, y, z is defined. 

For the first free levels, the perfect trees are shown Figure 21.3. An optimal ^4-level tree 
consists of an A node with an x leaf attached to it. The perfect level- B tree has a B as 
root with two perfect level-A trees as children. A node labeled with C having three children 
which all are optimal i?-level trees is the optimum at C-level, and so on. 

The objective function, subject to maximization, is computed recursively. The raw fitness 
of a node is the weighted sum of the fitness of its children. If the child is a perfect tree at the 
appropriate level, a perfect C tree beneath a D-node, for instance, its fitness is multiplied 
with the constant Full Bonus, which normally has the value 2. If the child is not a perfect 
tree, but has the correct root, the weight is PartialBonus (usually 1). If it is otherwise 
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Fig. 21.3.a: Perfect A-level Fig. 21.3.b: Perfect B-level Fig. 21.3.c: Perfect C-level 

Figure 21.3: The perfect Royal Trees. 



incorrect, its fitness is multiplied with Penalty, which is | per default. If the whole tree is 
a perfect tree, its raw fitness is finally multiplied with CompleteBonus which normally is 
also 2. The value of a a; leaf is 1. 

From Punch et al. [1678], we can furthermore borrow three examples for this fitness 
assignment and outline them in Figure 21.4. A tree which represents a perfect A level has 
the score of CompleteBonus * FullBonus *1 = 2*2*1=4. A complete and perfect tree at 
level B receives CompleteBonus(FullBonus*4 + FullBonus*A) — 2* (2*4 + 2*4) = 32. At 
level C, this makes CompleteBonus{FullBonus * 32 + FullBonus * 32 + FullBonus * 32) = 
2(2 * 32 + 2 * 32 + 2 * 32) = 384. 




B B B B B B B 

A A A A A A A 

ttttt tit 

XXXXXX XXXX XX 

Fig. 21.4.a: 2(2*32 + 2*32+ Fig. 21.4.b: 2(2*32 + 2*32+ Fig. 21.4.c: 2(2*32+ § * 1 + 

2* 32) = 384 §*1) = 128§ § * 1) = 64 § 

Figure 21.4: Example fitness evaluation of Royal Trees 



Other Derived Problems 

Storch and Wegener [1968, 1969, 1970] used their Real Royal Road for showing that there 
exist problems where crossover helps improving the performance of evolutionary algorithms. 
Naudts et al. [1505] have contributed generalized Royal Road functions functions in order 
to study epistasis. 

21.2.5 OneMax and Binlnt 

The OneMax and Binlnt are two very simple model problems for measuring the convergence 
of genetic algorithms. 

The OneMax Problem 

The task in the OneMax (or BitCount) problem is to find a binary string of length n 
consisting of all ones. The search and problem space are both the fixed-length bit strings 
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G = X = B n . Each gene (bit) has two alleles and 1 which also contribute exactly this value 
to the total fitness, i. e., 

n-1 

f(x) = ^x[i\, Mx e X (21.23) 

For the OneMax problem, an extensive body of research has been provided by Acklcy 
[12], Miihlenbein and Schlicrkamp-Voosen [1481], Thierens and Goldberg [2035], Miller and 
Goldberg [1416], Back [97], Blickle and Thicle [230], and Wilson and Kaur [2230]. 

The Binlnt Problem 

The Binlnt problem devised by Rudnick [1773] also uses the bit strings of the length n 
as search and problem space (G = X = B"). It is something like a perverted version of the 
OneMax problem, with the objective function defined as 

f(x) = 2"" i_1 a;[i], x[i\ € {0, 1} Vi e [0..n - 1] (21.24) 

i=0 

Since the bit at index i has a higher contribution to the fitness than all other bit at higher 
indices together, the comparison between two solution candidates x\ and X2 is won by the 
lexicographically bigger one. Thierens et al. [2036] give the example X\ = (1,1,1,1, 0, 0, 0, 0) 
and x 2 = (1,1,0,0,1,1,0,0), where the first deviating bit (underlined, at index 2) fully 
determines the outcome of the comparison of the two. 

We can expect that the bits with high contribution (high salience) will converge quickly 
whereas the other genes with lower salience are only pressured by selection when all others 
have already been fully converged. Rudnick [1773] called this sequential convergence phe- 
nomenon domino convergence due to its resemblance with a row of falling domino stones 
[2036] (see Section 1.4.2). Generally, he showed that first, the highly salient genes converge 
(i. e., take on the correct values in the majority of the population). Then, step by step, the 
building blocks of lower significance can converge, too. Another result of Rudnick's work 
is that mutation may stall the convergence because it can disturb the highly significant 
genes, which then counters the effect of the selection pressure on the less salient ones. Then, 
it becomes very less likely that the majority of the population will have the best alleles 
in these genes. This somehow dovetails with the idea of error thresholds from theoretical 
biology [625, 1552] which we have mentioned in Section 1.4.3. It is also explains some the 
experimental results obtained with the Royal Road problem from Section 21.2.4. The Binlnt 
problem was used in the studies of Sastry and Goldberg [1810, 1811]. 

One of the maybe most important conclusions from the behavior of GAs applied to the 
Binlnt problem is that applying a genetic algorithm to solve il problem (XCR") 

whilst encoding the solution candidates binary (G C B") in a straightforward manner will 
like produce suboptimal solutions. Schraudolph and Belew [1836], for instance, recognized 
this problem and suggested a Dynamic Parameter Encoding (DPE) where the values genes of 
the bit string are readjusted over time: Initially, optimization takes place on a rather coarse 
grained scale and after the optimum on this scale is approximated, the focus is shifted to a 
finer interval and the genotypes are re-encoded to fit into this interval. In their experiments, 
this method works better as the direct encoding. 

21.2.6 Long Path Problems 

The long path problems have been designed by Horn et al. [958] in order to construct a 
unimodal, non-deceptive problem without noise which hill climbing algorithms still can only 
solve in exponential time. The idea is to wind a path with increasing fitness through the 
search space so that any two adjacent points on the path are no further away than one 
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search step and any two points not adjacent on the path are away at least two search steps. 
All points which are not on the path should guide the search to its origin. 

The problem space and search space in their concrete realization is the space of the 
binary strings G = X = B ( of the fixed, odd length I. The objective function fi p (x) is 
subject to maximization. It is furthermore assumed that the search operations in hill climbing 
algorithms alter at most one bit per search step, from which we can follow that two adjacent 
points on the path have a Hamming distance of one and two non-adjacent points differ in 
at least two bits. 

The simplest instance of the long path problems that Horn et al. [958] define is the 
Root2path Pi. Paths of this type are constructed by iteratively increasing the search space 
dimension. Starting with Pi = (0, 1), the path p +2 is constructed from two copies P" = Pj 3 
of the path p as follows. First, we prepend 00 to all elements of the path P" and 11 to 
all elements of the path P/\ For I = 1 this makes Pf = (000,001) and P ; b = (110,111). 
Obviously, two elements on Pj 1 or on Pj 1 still have a Hamming distance of one whereas each 
element from P" differs at least two bits from each element on P/\ Then, a bridge element Bi 
is created that equals the last element of P ; °, but has 01 as the first two bits, i. e., B\ = 011. 
Now the sequence of the elements in P ; 6 is reversed and P", B[, and the reversed are 
concatenated. Hence, P3 = (000,001,011, 111, 110). Due to this recursive structure of the 
path construction, the path length increases exponentially with I (for odd I): 



len(P +2 ) = 2 * lcn(P) + 1 (21.25) 
len(P) = 3*2^ - 1 (21.26) 

The basic fitness of a solution candidate is if it is not on the path and its (zero-based) 
position on the path plus one if it is part of the path. The total number of points in the space 
is B' is 2 l and thus, the fraction occupied by the path is approximately 3*2~~ , i.e., decreases 
exponentially. In order to avoid that the long path problem becomes a needle-in-a-haystack 
problem 7 , Horn et al. [958] assign a fitness that leads the search algorithm to the path's origin 
to all off-path points x Q . Since the first point of the path is always the string 00 ... containing 
only zeros, subtracting the number of ones from I, i.e., fi p (x ) = I — countOccurences(l, x a ), 
is the method of choice. To all points on the path, I is added to the basic fitness, making 
them superior to all other solution candidates. 

Some examples for the construction of Root2paths can be found in Table 21.2 and the 
path for I = 3 is illustrated in Figure 21.5. In Algorithm 21.1 we try to outline how the 
objective value of a solution candidate x can be computed online. Here, please notice two 
things: First, this algorithm deviates from the one introduced by Horn et al. [958] - we 
tried to resolve the tail recursion and also added some minor changes. Another algorithm 
for determining fi p is given by Rudolph [1774]. The second thing to realize is that for small 
Z, we would not use the algorithm during the individual evaluation but rather a lookup 
table. Each solution candidate could directly be used as index for this table which contains 
the objective values. For I = 20, for example, a table with entries of the size of 4B would 
consume 4MiB which is acceptable on today's computers. 

The experiments of Horn et al. [958] showed that hill climbing methods that only con- 
centrate on sampling the neighborhood of the currently known best solution perform very 
poor on long path problems whereas genetic algorithms which combine different solution 
candidates via crossover easily find the correct solution. Rudolph [1774] shows that it does 
so in polynomial expected time. He also extends this idea long fc-paths in [1775]. Droste 
et al. [598] and Gamier and Kallel [774] analyze this path and find that also (l+l)-EAs can 
have exponential expected runtime on such unimodal functions. 

It should be mentioned that the Root2paths constructed according to the method de- 
scribed in this section here do not have the maximum length possible for long paths. Horn 



See Section 1.4.5 for more information on needle-in-a-haystack problems. 
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Figure 21.5: The root2path for I = 3. 



Pi = (0, 1) 

P 3 = (000, 001, 011, 111, 110) 

P 5 = (00000, 00001, 00011, 00111, 00110, OHIO , 11110, 11111, 11011, 11001, 11000) 
P 7 = (0000000, 0000001, 0000011, 0000111, 0000110, 0001110, 0011110, 0011111, 0011011 
0011001, 0011000, 0111000 , 1111000, 1111001, 1111011, 1111111, 1111110, 1101110 
1100110, 1100111, 1100011, 1100001, 1100000) 
P 9 = (000000000, 000000001, 000000011, 000000111, 000000110, 000001110, 000011110 
000011111, 000011011, 000011001, 000011000, 000111000, 
001111011, 001111111, 001111110, 001101110, 001100110, 
001100001, 001100000, 011100000 , 111100000, 111100001, 
111100110, 111101110, 111111110, 111111111 
110111000, 110011000, 110011001, 110011011 
110000110, 110000111, 110000011, 110000001, 



001111000, 001111001, 
001100111, 001100011, 
111100011, 111100111, 
111111011, 111111001, 111111000, 
110011111, 110011110, 110001110, 
110000000) 

Pn = (00000000000, 00000000001, 00000000011, 00000000111, 00000000110, 00000001110, 



00000011110 
00001111000 
00001100110 
00111100000 
00111111110 
00110011000 
00110000110 
11110000000 
11110011110 
11111111000 
11111100110 
11001100000 
11001111110 
11000011000 
11000000110 



00000011111 
00001111001 
00001100111 
00111100001 
00111111111 
00110011001 
00110000111 
11110000001 
11110011111 
11111111001 
11111100111 
11001100001 
11001111111 
11000011001 
11000000111 



00000011011 
00001111011 
00001100011 
00111100011 
00111111011 
00110011011 
00110000011 
11110000011 
11110011011 
11111111011 
11111100011 
11001100011 
11001111011 
11000011011 
11000000011 



00000011001, 00000011000, 00000111000, 
00001111111, 00001111110, 00001101110, 
00001100001, 00001100000, 00011100000, 
00111100111, 00111100110, 00111101110, 
00111111001, 00111111000, 00110111000, 
00110011111, 00110011110, 00110001110, 
00110000001, 00110000000, 01110000000 , 
11110000111, 11110000110, 
11110011001, 11110011000, 
11111111111, 11111111110, 
11111100001, 11111100000, 
11001100111, 11001100110. 



11001111001, 11001111000. 
11000011111, 11000011110. 
11000000001, 11000000000) 



11110001110, 
11110111000, 
11111101110, 
11011100000, 
11001101110, 
11000111000, 
11000001110, 



Table 21.2: Some long Root2paths for I from 1 to 11 with underlined bridge elements. 
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Algorithm 21.1: 



//pO 



Input: x: the solution candidate with an odd length 

Data: s: the current position in x 

Data: sign: the sign of the next position 

Data: isOnPath: true if and only if x is on the path 

Output: r: the objective value 

begin 

sign < — 1 

s < lcn(x) — 1 

r < — 

isOnPath < — true 
while (s > 0) A isOnPath do 
if s — then 
L if a;[o] = 1 then r < — r + sign 

sub < — subList(:r, s — 2, 2) 
if su& =11 then 

r < — r + sign * ^3*25 — 2^ 

sign < sign 

else 

if sub / 00 then 

if (x[s] = 0) A (x[s-i] = 1) A (a;[s-2] = 1) then 

if (s = 2) V [(a;[ s -3] = 1)A then 
(countOccurences(l, subList(a;, 0, s — 3)) = 0)] 



r + sign * ( 3 * 2 a 1 



else else isOnPath <— 
else isOnPath < — false 



false 



if isOnPath then r < — r + len(a;) 

else r < — len(x) — countOccurences(l, x) — 1 



22 end 



et al. [958] also introduce Fibonacci paths which are longer than the Root2paths. The prob- 
lem of finding maximum length paths in a ^-dimensional hypercubc is known as the snake- 
in-the-box 8 problem [1104, 483] which was first described by Kautz [1104] in the late 1950s. 
It is a very hard problem suffering from combinatorial explosion and currently, maximum 
snake lengths are only known for small values of I. 

21.2.7 Tunable Model for Problematic Phenomena 

What is a good model problem? Which model fits best to our purposes? These questions 
should be asked whenever we apply a benchmark, whenever we want to use something for 
testing the ability of a global optimization approach. The mathematical functions intro- 
duced in Section 21.1, for instance, are good for testing special mathematical reproduction 
operations like used in Evolution Strategies and for testing the capability of an evolutionary 
algorithm for estimating the Pareto frontier in multi-objective optimization. Kauffman's NK 
fitness landscape (discussed in Section 21.2.1) was intended to be a tool for exploring the 
relation of ruggedness and epistasis in fitness landscapes but can prove very useful for finding 
out how capable an global optimization algorithm is to deal with problems exhibiting these 
phenomena. In Section 21.2.4, we outlined the Royal Road functions, which were used to 



http : //en. wikipedia. org/wiki/Snake-in-the-box [accessed 2008-08-13] 
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investigate the ability of genetic algorithms to combine different useful formae and to test 
the Building Block Hypothesis. The Artificial Ant (Section 21.3.1) and the GCD problem 
from Section 21.3.2 are tests for the ability of Genetic Programming of learning algorithms. 

All these benchmarks and toy problems focus on specific aspects of global optimization 
and will exhibit different degrees of the problematic properties of optimization problems to 
which we had devoted Section 1.4: 

1. premature convergence and multimodality (Section 1.4.2), 

2. ruggedness (Section 1.4.3), 

3. deceptiveness (Section 1.4.4), 

4. neutrality and redundancy (Section 1.4.5), 

5. ovcrfitting and oversimplification (Section 1.4.8), and 

6. dynamically changing objective functions (Section 1.4.9). 

With the exception of the NK fitness landscape, it remains unclear to which degrees these 
phenomena occur in the test problem. How much intrinsic epistasis does the Artificial Ant 
or the GCD problem emit? What is the quantity of neutrality inherent in Royal Road for 
variable-length representations? Are the mathematical test functions rugged and, if so, to 
which degree? All the problems are useful test instances for global optimization. They have 
not been designed to give us answers to questions like: Which fitness assignment process 
can be useful when an optimization problem exhibits weak causality and thus has a rugged 
fitness landscape? How does a certain selection algorithm influence the ability of a genetic 
algorithm to deal with neutrality? Only Kauffman's NK landscape provides such answers, 
but only for epistatis. By fine-tuning its N and K parameters, we can generate problems with 
different degrees of epistatis. Applying a genetic algorithm to these problems then allows us 
to draw conclusions on its expected performance when being fed with high or low epistatic 
real- world problems. 

In this section, a new model problem is defined that exhibits ruggedness, epistasis, neu- 
trality, multi-objectivity, overfitting, and oversimplification features in a controllable manner 
Weise et al. [2185], Niemczyk [1533]. Each of them is introduced as a distinct filter compo- 
nent which can separately be activated, deactivated, and fine-tuned. This model provides 
a perfect test bed for optimization algorithms and their configuration settings. Based on a 
rough estimation of the structure of the fitness landscape of a given problem, tests can be 
run very fast using the model as a benchmark for the settings of an optimization algorithm. 
Thus, we could, for instance, determine a priori whether increasing the population size of 
an evolutionary algorithm over an approximated limit is likely to provide a gain. 

With it, we also can evaluate the behavior of an optimization method in the presence of 
various problematic aspects, like epistasis or neutrality. This way, strengths and weaknesses 
of different evolutionary approaches could be explored in a systematic manner. Additionally, 
it is also well suited for theoretical analysis because of its simplicity. The layers of the model, 
sketched using an example in Figure 21.6, are specified in the following. 

Model Definition 

The basic optimization task in this model is to find a binary string x* of a predefined length 
n = len(x*) consisting of alternating zeros and ones in the space of all possible binary strings 
X = B*. The tuning parameter for the problem size is n e N. 

x* = 0101010101010... 01 (21.27) 

Overfitting and Oversimplification 

Searching this optimal string could be done by comparing each genotype g with x*. There- 
fore we would use the Hamming distance 9 [882] dist# am (a, b), which defines the difference 



Definition 29.6 on page 537 includes the specification of the Hamming distance. 
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Figure 21.6: An example for the fitness landscape model. 



between two binary strings a and b of equal length as the number of bits in which they differ 
(see Equation 29.10). 

Instead of doing this directly, we test the solution candidate against tc training samples 
Ti,T 2 , --jTtc- These samples are modified versions of the perfect string x*. 

As outlined in Section 1.4.8 on page 72, we can distinguish between overfitting and 
oversimplification. The latter is often caused by incompleteness of the training cases and the 
former can originate from noise in the training data. Both forms can be expressed in terms 
of our model by the objective function f e , ,tc (based on a slightly extended version of the 
Hamming distance dist^ Qm ) which is subject to minimization. 
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dist^ am (a, b) = |{Vi : (a[i] + &[i]) A (few ^ *) A (0 < i < |a|)}| (21.28) 

tc 

fe,o,tc(x) = ]T dist^ am (x, Ti) , f e , ,tc{x) e [0, /] Vx e X (21.29) 

i=l 

In the case of oversimplification, the perfect solution x* will always reach a perfect score 
in all training cases. There may be incorrect solutions reaching this value in some cases too, 
because some of the facets of the problem arc hidden. We take this into consideration by 
placing o don't care symbols (*) uniformly distributed into the training cases. The values of 
the solution candidates at their loci have no influence on the fitness. 

When overfitting is enabled, the perfect solution will not reach the optimal score in any 
training case because of the noise present. Incorrect solutions may score better in some 
cases and even outperform the real solution if the noise level is high. Noise is introduced 
in the training cases by toggling e of the remaining n — o bits, again following a uniform 
distribution. An optimization algorithm can find a correct solution only if there are more 
training samples with correctly defined values for each locus than with wrong or don't care 
values. 

The optimal objective value is zero and the maximum / of f e ,o,tc is limited by the upper 
boundary f < (n — o)tc. Its exact value depends on the training cases. For each bit index i, 
we have to take into account whether a zero or a one in the phenotype would create larger 
errors: 

count(i,val) = \{j e l..n : Tj[i] = val}\ 

_ j count (z, 1) if count(i, 1) > count(i,Q) 
^ ' \ count(i, 0) otherwise 

tc 

Neutrality 

We can create a well-defined amount of neutrality during the genotype-phenotype mapping 
by applying a transformation u^ that shortens the solution candidates by a factor u. The i th 
bit in Ufj.(g) is defined as if and only if the majority of the fj, bits starting at locus i * \x in g 
is also 0, and as 1 otherwise. The default value 1 set in draw situations has (in average) no 
effect on the fitness since the target solution x* is defined as a sequence of alternating zeros 
and ones. If the length of a genotype g is not a multiple of a 1 the remaining len(g) mod /i 
bits are ignored. The tunable parameter for the neutrality in our model is \i. If \x is set to 
1, no additional neutrality is modeled. 

Epistasis 

Epistasis in general means that a slight change in one gene of a genotype influences some 
other genes. We can introduce epistasis in our model as part of the genotype mapping and 
apply it after the neutrality transformation. We therefore define a bijective function e v that 
translates a binary string z of length 77 to a binary string e r) (z) of the same length. Assume 
we have two binary strings z\ and z 2 which differ only in one single locus, i.e., their Hamming 
distance is one. e^(i)ntroduces epistasis by exhibiting the following property: 

dist Ham(zi,z 2 ) = 1 => dist ff am (e^ (z : ) , e n (z 2 )) > V - 1 Vzi,z 2 & ^ v (21.33) 

The meaning of Equation 21.33 is that a change of one bit in a genotype g leads to the 
change of at least rj—1 bits in the corresponding mapping e v (x). This, as well as the demand 
for bijectivity, is provided if we define as follows: 



(21.30) 
(21.31) 

(21.32) 
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{(g) z[j\ if < z < 2 r >~\i. e., z[ v -i] = 
Vj:0<j<r,A 
mod r, (21.34) 
e v (z — 2 r '~ 1 ) [i] otherwise 

In other words, for all strings z € M n which have the most significant bit (MSB) not 
set, the e v transformation is performed bitwise. The i th bit in e v (z) equals the exclusive or 
combination of all but one bit in z. Hence, each bit in z influences the value of rj — 1 bits in 
e^(z). For all z with 1 in the MSB, e v (z) is simply set to the negated e v transformation of z 
with the MSB cleared (the value of the MSB is This division in e is needed in order 

to ensure its bijectiveness. This and the compliance with Equation 21.33 can be shown with 
a rather lengthy proof omitted here. 

In order to introduce this model of epistasis in genotypes of arbitrary length, we divide 
them into blocks of the length r\ and transform each of them separately with e v . If the length 
of a given genotype g is no multiple of rj, the remaining len(g) mod rj bits at the end will 
be transformed with the function ei cn (g) mod ri instead of e r) , as outlined in Figure 21.6. It 
may also be an interesting fact that the e, ; transformations are a special case of the NK 
landscape discussed in Section 21.2.1 with N = rj and K ss rj — 2. 



z e 4 (z) 
0000 -» 0000 
•0001 -> 1101 
•0010 -> 1011 
•0100 -> 0111 
•1000 -> 1111^ 



z e 4 (z) 


z 




e 4 (z) 


1111 -> 1110 


0011 


-» 


0110 


0111 > 0001 


0101 


-> 


1010 


1011 ->• 1001 


0110 


-> 


1100 


1101 -> 0101 


1001 


-» 


0010 


1110 -> 0011 


1010 


-» 


0100 




1100 


-> 


1000 



Figure 21.7: An example for the epistasis mapping z — > e^z). 



The tunable parameter rj for the epistasis ranges from 2 to n * m, the product of the 
basic problem length n and the number of objectives m (see next section). If it is set to a 
value smaller than 3, no additional epistasis is introduced. Figure 21.7 outlines the mapping 
for rj = 4. 

Multi- Objectivity 

A multi-objective problem with m criteria can easily be created by interleaving m instances 
of the benchmark problem with each other and introducing separate objective functions 
for each of them. Instead of just dividing the genotype g in m blocks, each standing for 
one objective, we scatter the objectives as illustrated in Figure 21.6. The bits for the first 
objective function comprise x\ — (g[o], g[m], g[2m], . . . ), those used by the second objective 
function are x-i = (g[i], g[m+i], g[2m+i], . . .). Notice that no bit in g is used by more than one 
objective function. Superfluous bits (beyond index nm — 1) are ignored. If g is too short, 
the missing bits in the phenotypes are replaced with the complement from x*, i. e., if one 
objective misses the last bit (index n — 1), it is padded by x*[n-i] which will worsen the 
objective by 1 on average. 

Because of the interleaving, the objectives will begin to conflict if epistasis (77 > 2) is 
applied, similar to NK landscapes. Changing one bit in the genotype will change the outcome 
of at most min {77, m} objectives. Some of them may improve while others may worsen. 

A non-functional objective function minimizing the length of the genotypes is added if 
variable-length genomes are used during the evolution. If fixed-length genomes are used, 
they can be designed in a way that the blocks for the single objectives have always the right 
length. 
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Ruggedness 

In an optimization problem, there can be at least two (possibly interacting) sources of 
ruggedness of the fitness landscape. The first one, epistasis, has already been modeled and 
discussed. The other source concerns the objective functions themselves, the nature of the 
problem. We will introduce this type of ruggedness a posteriori by artificially lowering the 
causality of the problem space. We therefore shuffle the objective values with a permutation 
r : [0,/] i—* [0,/], where / the abbreviation for the maximum possible objective value, as 
defined in Equation 21.32. 

Before we do that, let us shortly outline what makes a function rugged. Ruggedness 
is obviously the opposite of smoothness and causality. In a smooth objective function, the 
objective values of the solution candidates neighboring in problem space are also neighboring. 
In our original problem with o = - 0, e = 0, and tc = 1 for instance, two individuals differing 
in one bit will also differ by one in their objective values. We can write down the list of 
objective values the solution candidates will take on when they are stepwise improved from 
the worst to the best possible configuration as (f, f — 1, .., 2, 1, 0^ . If we exchange two of the 

values in this list, we will create some artificial ruggedness. A measure for the ruggedness of 
such a permutation r is A(r): 

/-i 

4(r) = £|rM-r[i+i]| (21.35) 

i=0 

The original sequence of objective values has the minimum value A = / and the maximum 
possible value is A = ^"^ ■ There exists at least one permutation for each A value in 
A. .A. We can hence define the permutation r 7 which is applied after the objective values 
are computed and which has the following features: 

1. It is bijective (since it is a permutation). 

2. It must preserve the optimal value, i. e., r 7 [o] = 0. 

3. Z\(r 7 ) =A + 1 . 

With 76 0, A — A , we can fine-tune the ruggedness. For 7 = 0, no ruggedness 
is introduced. For a given /, we can compute the permutations r 7 with the procedure 
buildRPermutationf 7, defined in Algorithm 21.2. 




A(r )=5 f A(r,)=6 f 
y=0 7=1 
identity rugged 



A(r 2 )=7 f A(r 3 )=8 f 
y=2 y=3 
rugged rugged 




A(rJ=9 f A(r 5 )=10 f 
7=4 7=5 
rugged deceptive 



A r 6 )=ll f A(r 7 )=12 f A(r 8 )=13 f A r 9 )=14 f A(r 10 )=15f 

7=60 7=7 7=8 7=9 7=10 

deceptive deceptive rugged rugged rugged 



Figure 21.8: An example for r 7 with 7 = 0..10 and / = 5. 



21.2 Binary Problem Spaces 347 



Algorithm 21.2: r n 



buildRPcrmutation 



Input: 7: the 7 value 

Input: /: the maximum objective value 

Data: i,j,d,tmp: temporary variables 

Data: k, start, r: parameters of the subalgorithm 

Output: r 7 : the permutation r 7 

begin 

Subalgorithm r < — permutate(fc, r, start) 
begin 

if k> then 

if k < (/- 1) then 

r < — permutate(fe — 1, r, start) 
trap < — r[/] 
r[/] < — r[/-fe] 
r[/-fe] < — trap 
else 

if {start mod 2) = then 
i «— / + 1 - t 

d< 1 

else 

j d< — 1 

for j < — start up to / do 
r[j] < — i 
i < i + d 

— permutate^fc — f + start, r, start + 1 j 



end 

r«— (0,1,2,..,/-1,/ 
return pcrmutate(7, r, 1) 



24 end 



Figure 21.8 outlines all ruggedness permutations r 1 for an objective function which can 
range from to / = 5. As can be seen, the permutations scramble the objective function 
more and more with rising 7 and reduce its gradient information. 

Experimental Validation 

In this section, we will use a selection of the experimental results obtained with our model 
in order to validate the correctness of the approach. 10 Table 21.3 states the configuration 
of the evolutionary algorithm used for our experiments. For each of the experiment-specific 
settings discussed later, at least 50 runs have been performed. 

Parameter Short Description 

Problem X The variable-length bit strings consisting of between 1 and 8000 

Space bits, (see Section 3.5) 

Objective F F = {/ e , ,ici fnf}, where /„/ is the non-functional length criterion 

Functions fnf{%) = lcn(x) (see Equation 21.29) 

10 More experimental results and more elaborate discussions can be found in the bachelor's thesis 
of Niemczyk [1533]. 
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Search Space 


G 


G = X 


Search 


Op 


cr = 80% single-point crossover, mr — 20% single-bit mutation 


Operations 






GPM 


gpm 


(sec Section 21.2.7) 


Optimization 


alg 


plain genetic algorithm (see chapter 3) 


Algorithm 






Comparison 


cm 


ParetO Comparison (sec Section 1.2.2) 


Operator 






Population 


ps 


ps = 1000 


Size 






Steady-State 


ss 


The algorithms were generational (not steady-state) (ss = 0). (see 






Section 2.1.6) 


Fitness 


fa 


T"1 f* i • I'll 1 1 * 1 'i 1 T\ i 1 

box fatness assignment m the evolutionary algorithm, Pareto rank- 


Assignment 




ing was USed. (sec Section 2.3.3) 


Algorithm 






Selection 


sel 


A tournament selection with tournament size k = 5 was applied. 


Algorithm 




(sec Section 2.4.4) 


Convergence 


cp 


No additional means for convergence prevention were used, i. e., 


Prevention 




Cp = 0. (see Section 2.4.8) 


Generation 


mxt 


The maximum number of generations that each run is allowed to 


Limit 




perform, (sec Definition 1.43) 



mxt = 1001 



Tabic 21.3: The settings of the experiments with the benchmark model. 
Basic Complexity 

In the experiments, we distinguish between success and perfection. Success means finding 
individuals x of optimal functional fitness, i.e., f e . ,tc(x) — 0. Multiple such successful strings 
may exist, since superfluous bits at the end of genotypes do not influence their functional 
objective. We will refer to the number of generations needed to find a successful individual 
as success generations. The perfect string x* has no useless bits, it is the shortest possible 
solution with f £i0 .t c — and, hence, also optimal in the non-functional length criterion. In 
our experiments, we measure: 



Measure 


Short 


Description 


Success 


S/r 


The fraction of experimental runs that turned out successful, (sec 


Fraction 




Section 20.3.1) 


Minimum 


st 


The number of generations needed by the fastest (successful) ex- 


Success 




perimental run to find a successful individual, (see Section 20.3.1) 


Generation 






Mean Success 


~st 


The average number of generations needed by the (successful) ex- 


Generation 




perimental runs to find a successful individual, (see Section 20.3.1) 


Maximum 


St 


The number of Generations generations needed by the slowest (suc- 


Success 




cessful) experimental run to find a successful individual, (sec Sec- 


Generation 




tion 20.3.1) 


Mean 


pi 


The average number of generations needed by the (perfect) exper- 


Perfection 




imental runs to find a perfect individual. 


Generation 




equ:experimentPcrfAvgGen 



Table 21.4: First- level evaluation results of the experiments with the model benchmark. 

In Figure 21.9, we have computed the minimum, average, and maximum number of the 
success generations (st, st, and st) for values of n ranging from 8 to 800. As illustrated, the 
problem hardness increases steadily with rising string length n. Trimming down the solution 
strings to the perfect length becomes more and more complicated with growing n. This is 
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Figure 21.9: The basic problem hardness. 



likely because the fraction at the end of the strings where the trimming is to be performed 
will shrinks in comparison with its overall length. 

Ruggedness 



at n=80 



■+ 

© 



110 h st 
100 
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80 
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30 
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j.* -t * deceptivenes / 
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Figure 21.10: Experimental results for the ruggedness. 
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In Figure 21.10, we plotted the average success generations st with n = 80 and different 
ruggcdness settings 7. Interestingly, the gray original curve behaves very strangely. It is 
divided into alternating solvable and unsolvable 11 problems. The unsolvable ranges of 7 
correspond to gaps in the curve. With rising 7, the solvable problems require more and more 
generations until they are solved. After a certain (earlier) 7 threshold value, the unsolvable 
sections become solvable. From there on, they become simpler with rising 7. At some point, 
the two parts of the curve meet. 



Algorithm 21.3: 7 < — translate 



Input: 7': the raw 7 value 
Input: /: the maximum value of / £ , ,tc 
Data: i, j, k, I: some temporary variables 
Output: 7: the translated 7 value 

1 begin 



3 
5 

6 

7 

8 

9 
10 
11 

12 end 



if 7 < fi then 

3 
k 



f+2 
2 



+ 1-7 



7 - J 



+f + f 



return k + 2 ( j (/ + 2) - f - /) - j 



else 



(/ m °d2) + l + / l-(/mod2) | _ 1 



fc «— 7 " (J " ( / mod 2J J (3 - 1) - 1 - i 
return / - fc - 2j 2 + j - (/ mod 2) (-2j + 1) 



The reason for this behavior is rooted in the way that we construct the rugged- 
ness mapping r and illustrates the close relation between ruggcdness and deceptiveness. 
Algorithm 21.2 is a greedy algorithm which alternates between creating groups of mappings 
that are mainly rugged and such that are mainly deceptive. In Figure 21.8 for instance, from 
7 = 5 to 7 = 7, the permutations exhibit a high degree of deceptiveness whilst just being 
rugged before and after that range. Thus, it seems to be a good idea to rearrange these 
sections of the ruggedness mapping. The identity mapping should still come first, followed 
by the purely rugged mappings ordered by their Z\-values. Then, the permutations should 
gradually change from rugged to deceptive and the last mapping should be the most de- 
ceptive one (7 = 10 in Figure 21.8). The black curve in Figure 21.10 depicts the results of 
rearranging the 7-values with Algorithm 21.3. This algorithm maps deceptive gaps to higher 
7- values and, by doing so, makes the resulting curve continuous. 12 

Fig. 21. 11. a sketches the average success generations for the rearranged ruggedness prob- 
lem for multiple values of n and 7'. Depending on the basic problem size n, the problem 
hardness increases steeply with rising values of 7'. 

In Algorithm 21.2 and Algorithm 21.3, we use the maximum value of the functional 
objective function (abbreviated with /) in order to build and to rearrange the ruggedness 
permutations r. Since this value depends on the basic problem length n, the number of 

11 We call a problem unsolvable if it has not been solved within 1000 generations. 

12 This is a deviation from our original idea, but this idea did not consider deceptiveness. 
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Fig. 21. 11. a: Unsealed ruggedness st plot. Fig. 21.11.b: Scaled ruggedness st plot. 





Fig. 21.11.C: Scaled deceptiveness - average sue- Fig. 21.11.d: Scaled deceptiveness - failed runs 
cess generations st 

Figure 21.11: Experiments with ruggedness and deceptiveness. 



different permutations and thus, the range of the 7' values will too. The length of the lines 
in direction of the 7' axis in Fig. 21. 11. a thus increases with n. We introduce two addi- 
tional scaling functions for ruggedness and deceptiveness with a parameter g spanning from 
zero to ten, regardless of n. Only one of these functions can be used at a time, depend- 
ing on whether experiments should be run for rugged (Equation 21.36) or for deceptive 
(Equation 21.37) problems. For scaling, we use the highest 7' value which maps to rugged 



mappings j r 



5/J 



0.5/ 



, and the minimum and maximum ruggedness values accord- 



ing to Equation 21.35. 



rugged: 7' = round(0.1<? * 7^.) (21.36) 
f if g < 

deceptive: 7' = j ^ + round ( . l9 *(A-A-j> r )) otherwise (2L37) 

When using this scaling mechanism, the curves resulting from experiments with different 
n-values can be compared more easily: Fig. 21.11.b based on the scale from Equation 21.36, 
for instance, shows much clearer how the problem difficulty rises with increasing ruggedness 
than Fig. 21. 11. a. We also can spot some irregularities which always occur at about the 
same degree of ruggedness, near g « 9.5, and that we will investigate in future. 

The experiments with the deceptiveness scale Equation 21.37 show the tremendous effect 
of deceptiveness in the fitness landscape. Not only does the problem hardness rise steeply 
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with g (Fig. 21.11.c), after certain threshold, the evolutionary algorithm becomes unable to 
solve the model problem at all (in 1000 generations) , and the fraction of failed experiments 
in Fig. 21.11.d jumps to 100% (since the fraction s/r of solved ones goes to zero). 

Epistasis 




Fig. 21. 12. a: Epistasis r\ and problem length 
n. 




Fig. 21.12.b: Epistasis rj and problem length 
n: failed runs. 



Figure 21.12: Experiments with epistasis. 



Fig. 21. 12. a illustrates the relation between problem size n, the epistasis factor 77, and 
the average success generations. Although rising epistasis makes the problems harder, the 
complexity does not rise as smoothly as in the previous experiments. The cause for this is 
likely the presence of crossover - if mutation was allowed solely, the impact of epistasis would 
most likely be more intense. Another interesting fact is that experimental settings with odd 
values of rj tend to be much more complex than those with even ones. This relation becomes 
even more obvious in Fig. 21.12.b, where the proportion of failed runs, i.e., those which were 
not able to solve problem in less than 1000 generations, is plotted. A high plateau for greater 
values of 77 is cut by deep valleys at positions where 77 = 2 + 2i Vi € N. This phenomenon has 
thoroughly been discussed by Niemczyk [1533] and can be excluded from the experiments 
by applying the scaling mechanism with parameter y £ [0, 10] as defined in Equation 21.38: 

( y < if 2 

epistasis: 77 = < y > 10 if 41 (21.38) 
[ 2 [2y\ + 1 otherwise 

Neutrality 

Figure 21.13 illustrates the average number of generations st needed to grow an individual 
with optimal functional fitness for different values of the neutrality parameter /j. Until 
/1 10, the problem hardness increases rapidly. For larger degrees of redundancy, only 
minor increments in st can be observed. The reason for this strange behavior seems to be the 
crossover operation. Niemczyk [1533] shows that a lower crossover rate makes experiments 
involving the neutrality filter of the model problem very hard. We recommend using only 
/x- values in the range from zero to eleven for testing the capability of optimization methods 
of dealing with neutrality. 
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Figure 21.13: The results from experiments with neutrality. 



Epistasis and Neutrality 

Our model problem consists of independent filters for the properties that may influence the 
hardness of an optimization task. It is especially interesting to find out whether these filters 
can be combined arbitrarily, i. e., if they are indeed free of interaction. In the ideal case, st 
of an experiment with n = 80, /i = 8, and r\ — added to st of an experiment for n = 80, 
/i = 0, and n = 4 should roughly equal to st of an experiment with n — 80, fi = 8, and 
r\ = 4. In Figure 21.14, we have sketched these expected values (Fig. 21. 14. a) and the results 
of the corresponding real experiments (Fig. 21.14.b). In fact, these two diagrams are very 
similar. The small valleys caused by the "easier" values of n (see Section 21.2.7) occur in 
both charts. The only difference is a stronger influence of the degree of neutrality 




Fig. 21. 14. a: The expected results. Fig. 21.14.b: The experimentally obtained re- 

sults. 



Figure 21.14: Expection and reality: Experiments involving both, epistasis and neutrality 



Ruggedness and Epistasis 

It is a well-known fact that epistasis leads to ruggedness, since it violates the causality as 
discussed in Section 1.4.6. Combining the ruggedness and the epistasis filter therefore leads 
to stronger interactions. In Fig. 21.15.b, the influence of ruggedness seems to be amplified 
by the presence of epistasis when compared with the estimated results shown in Fig. 21. 15. a. 
Apart from this increase in problem hardness, the model problem behaves as expected. The 
characteristic valleys stemming from the epistasis filter are clearly visible, for instance. 
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Fig. 21. 15. a: The expected results. Fig. 21.15.b: The experimentally obtained 

results. 



Figure 21.15: Expection and reality: Experiments involving both, ruggedness and epistasis 



Summary 

In summary, this model problem has proven to be a viable approach for simulating prob- 
lematic phenomena in optimization. It is 

1. functional, i. e., allows us to simulate many problematic features, 

2. tunable - each filter can be tuned independently, 

3. easy to understand, 

4. allows for very fast fitness computation, 

5. easily extensible - each filter can be replaced with other approaches for simulating the 
same feature. 

Niemczyk [1533] has written a stand-alone Java class implementing the model problem 
which is provided at http://www.sigoa.org/documents/ [accessed 2008-05-17] and http://www. 
it-weise.de/documents/files/TunableModel.java [accessed 2008-05-17]. This class allows set- 
ting the parameters discussed in this section and provides methods for determining the 
objective values of individuals in the form of byte arrays. In the future, some strange behav- 
iors (like the irregularities in the ruggedness filter and the gaps in epistasis) of the model 
need to be revisited, explained, and, if possible, removed. 

21.3 Genetic Programming Problems 
21.3.1 Artificial Ant 

We already have discussed parts of the Artificial Ant problem in Section 1.2.2 on page 27 - 
here we are going to investigate it more thoroughly. The goal of the original problem defined 
by Collins and Jefferson [431, 1046, 433] was to find a program that controls an artificial 
ant in a simulated environment. Such environments usually have the following features: 

1. It is divided in a toroidal grid generating rectangular cells in the plane making the 
positions of coordinates of all objects discrete. 

2. There exists exactly one ant in the environment. 

3. The ant will always be inside one cell at one time. 

4. A cell can either contain one piece of food or not. 
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The ant is a very simple life form. It always faces in one of the four directions north, east, 
south, or west. Furthermore, it can sense if there is food in the next cell in the direction it 
faces. It cannot sense if there is food on any other cell in the map. 

Like space, the time in the Artificial Ant problem is also discrete. The ant may carry out 
one of the following actions per time unit: 

1. The ant can move for exactly one cell into the direction it faces. If this cell contains 
food, the ant consumes it in the very moment in which it enters the cell. 

2. The ant may turn left or right by 90. 

3. The ant may do nothing in a time unit. 

Many researchers such as Collins and Jefferson [431, 1046, 433], Koza [1196], Lee and 
Wong [1268], Harries and Smith [900], Luke and Spector [1322], Kuscu [1226], Chellapilla 
[384], Ito et al. [1024], Langdon and Poli [1240], and Frey [749] have ever since used the Ar- 
tificial Ant problem as benchmark in their research. Since the Artificial Ant problem neither 
imposes a special genome, phcnomc, nor otherwise restricts the parameters of the opti- 
mization process, it is the ideal environment for such tests. In order to make the benchmark 
results comparable, special instances of the problem like the Santa Fe Trail with well-defined 
features have been defined. 

Santa Fe trail 

One instance of the artificial ant problem is the "Santa Fe trail" sketched in Figure 21.16 
designed by Langdon [1196]. It is a map of 32*32 cells containing 89 food pellets distributed 
along a certain route. Initially, the ant will be placed in the upper left corner of the field 
facing east. In trail of food pellets, there are gaps of five forms: 

1. one cells along a straight line 

2. two cells along a straight line 

3. one cell in a corner 

4. two cells at a corner (requiring something like a "horse jump" in chess) 

5. three cells at a corner 

The goal is here to find some form of control for the ant that allows it to eat as many of 
the food pellets as possible (the maximum is 89) and to walk a distance as short as possible 
in order to do so (the optimal route is illustrated in Figure 21.16). Of course, there will be 
a time limit set for the ant to perform this task (normally 200 time units). 

Solutions 

Genetic Algorithm evolving Finite State Machines 

Jefferson et al. [1046] applied a conventional genetic algorithm that evolved finite state 
machines encoded in a fixed-length binary string genome to the Artificial Ant problem. The 
sensor information together with the current state determines the next state, therefore a 
finite state machine with at most m states can be encoded in a chromosome using 2m genes. 
In order to understand the structure of such a chromosome, let us assume that m — 2™. We 
then can specify the finite state machine as a table where n + 1 bits are used as row index. 
n of these indexes identify the current state and one bit is used for the sensor information 
(l=food ahead, 0=no food ahead). In total, there are 2m rows. There is no need to store 
the row indices, just the cell contents: n bits encode the next state, and two bits encode the 
action to be performed at the state transition (00 for nothing, 01 for turning left, 10 for 
turning right, 11 for moving). A chromosome encoding a finite state machine with m states 
can be encoded in 2m(n + 2) = 2 n+1 (n + 2) bits. If the initial state in the chromosome is 
also to be stored, another n bits are needed to do so. Every chromosome represents a valid 
finite state machine. 
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Figure 21.16: The Santa Fee Trail in the Artificial Ant Problem [11M 



Jefferson et al. [1046] allowed for 32 states (453 bit chromosomes) in their finite state 
machines. They used one objective function that returned the number of food pellets eaten 
by the ant in a simulation run (of maximal 200 steps) and made it subject to maximization. 
Using a population of 65 536 individuals, they found one optimal solution (with fitness 89). 

Genetic Algorithm evolving Artificial Neural Networks 

Collins and Jefferson [431] also evolved an artificial neural network (encoded in a 520 bit 
genome) with a genetic algorithm of the same population size to successfully solve the 
Artificial Ant problem. Later, they applied a similar approach [433] with 25 590 bit genomes 
which allowed even the structure of the artificial neural networks to evolve to a generalized 
problem exploring the overall behavior of ant colonies from a more biological perspective. 

Genetic Programming evolving Control Programs 

Koza [1189, 1188, 1187] solved the Artificial Ant problem by evolving LISP 1 ''-programs. 
Therefore, he introduced the parameterless instructions MOVE, RIGHT, and LEFT that moved 
the ant one unit, or turned it right or left respectively. Furthermore, the binary conditional 
expression IF-F00D-AHEAD executed its first parameter expression if the ant could sense food 
and the second one otherwise. Two compound instructions, PR0GN2 and PR0GN3, execute their 
two or three sub-expressions unconditionally. After 21 generations using a 500 individual 
population and fitness-proportional selection, Genetic Programming yielded an individual 
solving the Santa Fe trail optimally. Connolly [435] provides an easy step-by-step guide 
for solving the Artificial Ant problem with Genetic Programming (and for visualizing this 
process). 

13 http://en.wikipedia.org/wiki/Lisp_yo28prograiming_languagey.29 [accessed 2007-07-03] 
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21.3.2 The Greatest Common Divisor 

Another problem suitable to test Genetic Programming approaches is to evolve an algorithm 
that computes the greatest common divisor 14 , the GCD. 

Problem Definition 

Definition 21.1 (GCD). For two integer numbers a, 6 e No, the greatest common divi- 
sor (GCD) is the largest number c £ N that divides both, a (c \a = a mod c = 0) and b 
(c \b = b mod c = 0). 

c = gcd (a, b) ^ c \a A c \b A ($d £ N : d\ a A d \b A d > c) (21.39) 
max {e G N : (a mod e = 0) A (b mod e = 0)} (21.40) 

The Euclidean Algorithm 

The GCD can be computed with the Euclidean algorithm 10 which is specified in its original 
version in Algorithm 21.4 and in the improved, faster variant as Algorithm 21.5 [1559, 913]. 



Algorithm 21.4: gcd (a, b) < — cuclidGcdOrig(a, b) 



Input: a, & £ No: two integers 

Output: gcd (a, b): the greatest common divisor of a and b 
l begin 



while b 7^ do 

if a > b then a < — a — b 
else b < — b — a 



return a 
6 end 



Algorithm 21.5: gcd (a, b) < — cuclidGcd(a, b) 



Input: a, & £ No: two integers 
Data: t: a temporary variable 

Output: gcd (a, &): the greatest common divisor of a and b 

1 begin 

2 while b 7^ do 

3 t< — b 

4 b < — a mod b 

5 a < — t 

6 return a 

7 end 



The Objective Functions and the Prevalence Comparator 

Although the GCD-problems seems to be more or less trivial since simple algorithms exist 
that solve it, it has characteristics that make it hard of Genetic Programming. Assume we 

14 http://en.wikipedia.org/wiki/Greatest_common_divisor [accessed 2007-10 05] 

15 http://en.wikipedia.org/wiki/Euclidean_algorithm [accessed 2007-10-05] 
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have evolved a program i£l which takes the two values a and b as input parameters and 
returns a new value c = x(a, b). Unlike in symbolic regression 16 , it makes no sense to define 
the error between c and the real value gcd (a, b) as objective function, since there is no relation 
between the "degree of correctness" of the algorithm and |c — gcd (a, b)\. Matter of fact, wc 
cannot say that a program returning ci = Xi(20, 15) = 6 is better than C2 = £2(20, 15) = 10. 
6 may be closer to the real result gcd (20, 15) = 5 but shares no divisor with it whereas 
5|10 = 10 mod 5 = 0. 

Based on the idea that the GCD is of the variables a and b is preserved in each step 
of the Euclidean algorithm, a suitable functional objective function /1 :Xh [0, 5] for this 
problem is Algorithm 21.6. It takes a training case (a, b) as argument and first checks whether 
the evolved program x € X returns the correct result for it. If so, f\{x) = is returned. 
Otherwise, we check if the greatest common divisor of x(a, b) and a is still the greatest 
common divisor of a and b. If this is not the case, 1 is added to the objective value. The 
same is repeated with x(a,b) and b. Furthermore, negative values of x(a, b) are penalized 
with 2 and results that are larger or equal to a or b are penalized with one additional point 
for each violation. Depending on the program representation, this objective function is very 
rugged because small changes in the program have a large potential impact on the fitness. It 
also exhibits a large amount of neutrality, since it can take on only integer values between 
(the optimum) and 5 (the worst case). 



Algorithm 21.6: r < — fi(x,a,b) 



Input: o, b £ No: the training case 

Input: x G X: the evolved algorithm to be evaluated 

Data: v. a variable holding the result of x for the training case 

Output: r: the functional objective value of the functional objective function /1 for the 
training case 

1 begin 

2 r < 

3 v < x(a, b) 

4 if v gcd (a, b) then 

5 r < — r + 1 

6 if gcd (v, a) =fc gcd (a, b) then r < — r + 1 

7 if gcd (v, b) =fc gcd (a, b) then r < — r + 1 

8 if v < then 

9 S r « — r + 2 

10 else 

11 if v > a then r < — r + 1 

12 if v > b then r < — r + 1 



13 



return r 



14 end 



Additionally to /1, two objective functions optimizing non- functional aspects should be 
present. f2(x) minimizes the number of expressions in x and fs(x) minimizes the number 
of steps x needs until it terminates and returns the result. This way, we further small and 
fast algorithms. These three objective functions, combined to a prevalence 17 comparator 
cmp Fgcd , can serve as a benchmark on how good a Genetic Programming approach can 
cope with the rugged fitness landscape common to the evolution of real algorithms and how 
the parameter settings of the evolutionary algorithm influence this ability. 



See Section 23.1 for an extensive discussion of symbolic regression. 
See Definition 1.17 on page 39 for more information. 



cmp F)gcd (a;i,a:2) = 



cwp FParet0 




(21.41) 
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In principle, Equation 21.41 gives the functional fitness precedence before any other 
objective. If (and only if) the functional objective values of both individuals are equal, 
the prevalence is decided upon a Pareto comparison of the remaining two (non-functional) 
objectives. 

The Training Cases 

The structure of the training cases is also very important. If we simply use two random 
numbers a and 6, their greatest common divisor is likely to 1 or 2. Hence, we construct 
a single training case by first drawing a random number r e N uniformly distributed in 
[10, 100 000] as lower limit for the GCD and then keep drawing uniformly distributed random 
numbers a > r, b > r until gcd (a, b) > r. Furthermore, if multiple training cases are involved 
in the individual evaluation, we ensure that they involve different magnitudes of the values 
of a, b, and r. If we change the training cases after each generation of the evolutionary 
algorithm, the same goes for two subsequent training case sets. Some typical training sets 
are noted in Listing 21.2. 



Generation 





D 




87R46096 


901 9500485 


21627 




1 1 4nfi9 "3 R 
IDIIUDZO J 


Q 1 Ci 1 






1 Any 


2008942236 


579260484 


972 


556527320 


1588840 


144440 


14328736 


10746552 


3582184 


1390 


268760 


10 


929436304 


860551 


5153 


941094 


1690414110 


1386 


14044248 


1259211564 


53604 


Generation 


1 




a 


b 


gcd (a , b) 


117140 


1194828 


23428 


2367 


42080 


263 


3236545 


379925 


65 


1796284190 


979395390 


10 


4760 


152346030 


10 


12037362 


708102 


186 


1785869184 


2093777664 


61581696 


782331 


42435530 


23707 


434150199 


24453828 


63 


45509100 


7316463 


35007 


Generation 


2 




a 


b 


gcd (a , b) 


1749281113 


82 


41 


25328611 


99 


11 


279351072 


2028016224 


3627936 


173078655 


214140 


53535 


216 


126 


18 


1607646156 


583719700 


2836 



1059261 638524299 21 
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70903440 1035432 5256 

26576383 19043 139 

1349426 596258 31382 

Listing 21.2: Some training cases for the GCD problem. 



Rule-based Genetic Programming 

We have conducted a rather large series of experiments on solving the GCD problem with 
Rule-based Genetic Programming (RBGP, see Section 4.8.4 on page 207). In this section, 
we will elaborate on the different parameters that we have tried out and what results could 
be observed for these settings. 

Settings 

As outlined in Table 21.5, we have tried to solve the GCD problem with Rule-based Genetic 
Programming with a lot of different settings (60 in total) in a factorial experiment. We will 
discuss these settings here in accordance to Section 20.1. 



Parameter Short Description 



Problem 


X 


Space 




Objective 


F 


Functions 




Search Space 


G 


Search 


Op 


Operations 




GPM 


gpm 


Optimization 


alg 


Algorithm 




Comparison 


cm 


Operator 




Population 


ps 


Size 




Steady-State 


ss 



Fitness 

Assignment 

Algorithm 

Selection 

Algorithm 

Convergence 

Prevention 



fa 



sel 



cp 



The space of Rule-based Genetic Programming-programs with be- 
tween 2 and 100 rules, (sec Section 4.8.4) 
F = {/l,/2, h} (sec Section 21.3.2) 

The variable-length bit strings with a gene size of 34 bits and a 
length between 64 and 3400 bits. 

mr = 30% mutation (including single-bit flits, permutations, gene 
deletion and insertion), cr^ 70% multi-point crossover 

(see Figure 4.29) 

The optimization algorithm applied. 
alg = — > evolutionary algorithm 
alg = 1 — > Parallel Random Walks 

(see Equation 21.41) 

(512,1024,2048} 



The evolutionary algorithms were either steady state (ss = 1), 
meaning that the offspring had to compete with the already ex- 
isting individuals in the selection phase, or generational/extinctive 
(ss = 0), meaning that only the offspring took part in the selection 
and the parents were discarded, (sec Section 2.1.6) 
ss = — > generational 
ss = 1 — ► steady-state 

For fitness assignment in the evolutionary algorithm, Pareto rank- 
ing was USed. (see Section 2.3.3) 

A binary (k = 2) tournament selection was applied in the evolu- 
tionary algorithm, (see Section 2.4.4) 
(see Section 2.4.8) 

cpe {0,0.3} 
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Number of 


tc 


The number of training cases used for evaluating the objective func- 


Training 




tions. 


Cases 




tc e {l, 10} 


Training Case 


ct 


The policy according to which the training cases are changed. 


Change 




ct = — + The training cases do not change. 


Policy 




ct = 1 — > The training cases change each generation. 


Generation 


mxt 


The maximum number of generations that each run is allowed to 


Limit 




perform, (sec Definition 1.43) 






mxt = 501 


System 


Cfg 


normal off-the-shelf PCs with approximately 2 GHz processor power 


Configuration 







Table 21.5: The settings of the RBGP-Genetic Programming experiments for the GCD 
problem. 

Convergence Prevention In our past experiments, we have made the experience that Genetic 
Programming in rugged fitness landscapes and Genetic Programming of real algorithms 
(which usually leads to rugged fitness landscapes) is very inclined to converge prematurely. 
If it finds some half-baked solution, the population often tended to converge to this individual 
and the evolutions stopped. 

There are many ways to prevent this, like modifying the fitness assignment process 
by using sharing functions (see Section 2.3.4 on page 114), for example. Such methods 
influence individuals close in objective space and decrease their chance to reproduce. Here, 
we decided to choose a very simple measure which only decreases probability of reproduction 
of individuals with exactly equal objective functions: the simple convergence prevention 
algorithm SCP introduced in Section 2.4.8. This filter has either been applied with strength 
cp = 0.3 or not been used (cp = 0). 

Comparison with Random Walks We found it necessary to compare the Genetic Program- 
ming approach for solving this problem with random walks in order to find out whether or not 
Genetic Programming can provide any advantage in a rugged fitness landscape. Therefore, 
we either used an evolutionary algorithm with the parameters discussed above (alg = 0) or 
parallel random walks (alg = 1). Random walks, in this context, are principally evolutionary 
algorithms where neither fitness assignment nor selection are preformed. Hence, we can test 
parameters like ps, ct, and tc, but no convergence prevention (cp — 0) and also no steady 
state (expSteady State = 0). The results of these random walks are the best individuals 
encountered during their course. 

Results 

We have determined the following parameters from the data obtained with our experiments. 
Measure Short Description 

Perfection p/r The fraction of experimental runs that found perfect individuals. 

Fraction This fraction is also the estimate of the cumulative probability 

of finding a perfect individual until the 500th generation. ( see Sec- 
tion 20.3.1) 

p/r = CPp(ps, 500) (see Equation 20.20) 
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The number of runs where perfect individuals were discovered, (see 



Number of 
Perfect Runs 
Number of 
Successful 
Runs 

Number of 
Comp. Runs 
Mean Success 
Generation 

Runs Needed 
for Perfection 

Evaluations 
Needed for 
Perfection 



#p 



st 



pt r 



Section 20.3.1) 

The number of runs where successful individuals were discovered. 

(sec Section 20.3.1) 

The total number of completed runs with the specified configura- 
tion. 

The average number of generations t needed by the (successful) 
experimental runs to find a successful individual, (or if no run 

Was Successful) (sec Section 20.3.1) 

The estimated number pt n (0.99, ps, 500) of independent runs 
needed to find at least one perfect solution candidate with a prob- 
ability of z = 0.99 until the 500th generation, (sec Equation 20.21) 
The estimated number pt n (0.99,ps, 500) of objective function eval- 
uations runs needed to find at least one perfect solution candidate 
with a probability of z = 0.99 until the 500th generation, (see Sec- 
tion 20.3.2) 



Table 21.6: Evaluation parameters used in Table 21.7. 

In the context of this experiment, a perfect solution represents a correct GCD algorithm, 
i.e., is not overfitted. Solutions with optimal functional objective values (/1 = 0, whether due 
to overfitting or not) are called successful. Overfitted programs, like the one illustrated in 
Listing 21.4, will not work with inputs a and b different from those used in their evaluation. 

Not all configurations were tested with the same number of runs since we had multiple 
computers involved in these test series and needed to end it at some point of time. We then 
used the maximum amount of available data for our evaluation. With the given configura- 
tions, the evolutionary algorithm runs usually took about one to ten minutes (depending on 
the population size). The results of the application of the Rule-based Genetic Programming 
to the GCD problem are listed in Table 21.7 below. 



nk 


alg 


cp 


ss 


ct 


tc 


ps 


piv 


#p 


#s 


#r 


st 






1. 





0.3 


1 





1 


1024 


0.28 


15 


45 


53 


100.4 


13.84 


7086 884 


2. 





0.3 


1 





1 


512 


0.12 


6 


35 


51 


98.5 


36.79 


9 419 095 


3. 


1 











1 


512 


0.10 


5 


27 


51 


259.1 


44.63 


11425 423 


4. 





0.3 


1 





10 


2048 


0.98 


48 


48 


49 


70.0 


1.18 


12116 937 


5. 


1 











1 


1024 


0.17 


9 


41 


54 


170.0 


25.26 


12 932 355 


6. 





0.3 








1 


2048 


0.27 


14 


49 


51 


85.2 


14.35 


14 694 861 


7. 





0.3 


1 





10 


1024 


0.78 


41 


41 


53 


129.1 


3.1 


15 873 640 


8. 





0.3 


1 





1 


2048 


0.25 


13 


51 


51 


36.4 


15.65 


16 026 722 


9. 





0.3 


1 





10 


512 


0.49 


25 


25 


51 


153.0 


6.84 


17498 481 


10. 





0.3 








1 


512 


0.06 


3 


22 


51 


162.1 


75.96 


19 446 283 


11. 





0.3 





1 


10 


1024 


0.67 


37 


37 


54 


199.4 


3.98 


20400 648 


12. 





0.3 





1 


1 


1024 


0.10 


5 


5 


52 


197.4 


45.55 


23 322 826 


13. 





0.3 


1 


1 


10 


2048 


0.98 


53 


53 


54 


61.1 


1.15 


23 643 586 


14. 





0.3 








1 


1024 


0.09 


5 


44 


54 


138.2 


47.4 


24 266 737 


15. 





0.3 








10 


2048 


0.79 


39 


39 


49 


111.1 


2.9 


29 672 727 


16. 





0.3 





1 


10 


2048 


0.79 


41 


41 


52 


101.1 


2.96 


30 358 251 


17. 





0.3 


1 


1 


10 


1024 


0.78 


42 


42 


54 


125.5 


3.06 


31352 737 
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18. 





0.3 


1 


1 


1 


2048 


0.28 


14 


14 


54 


107.8 


15.35 


31427005 


19. 





0.3 








10 


1024 


0.52 


27 


27 


53 


196.3 


6.47 


33 106 746 


20. 





0.3 






10 


512 


0.25 13 13 52 231.5 


16.01 


40 980 085 


21. 





0.3 


1 


1 


10 


512 


0.41 


21 


21 


50 


231.5 


8.45 


43 284 918 


22. 





0.3 





1 


1 


2048 


0.09 


5 


5 


53 


46.6 


46.47 


47 589 578 


23. 





0.3 








10 


512 


0.19 


10 


10 


52 


250.8 


21.56 


55 199 744 


24. 





0.3 


1 


1 


1 


512 


0.04 


2 


2 


49 


102.0 


110.51 


56 580 143 


25. 





0.3 


1 


1 


1 


1024 


0.06 


3 


3 


52 


116.0 


77.5 


79 357 503 


26. 








10 


1024 


15 


8 


8 


55 


263 


29.3 


150 004 032 


27. 








1 


o 


10 


1024 


0.13 


7 


7 


53 


272.3 


32.51 


166 455 244 


28. 








o 


o 


10 


2048 


0.24 


12 


12 


49 


280.6 


16.39 


167 876 619 


29. 








1 


o 


1 


2048 


0.02 


1 


18 


51 


245.5 


232.55 


238134 779 


30. 


1 





o 


o 


1 


2048 


0.02 


1 


50 


54 


120.9 


246.37 


252 282 298 


31. 








1 


o 


10 


2048 


0.16 


8 


8 


49 


249.9 


25.84 


264 557 703 


32. 








1 


1 


10 


512 


0.08 


4 


4 


50 


320.3 


55.23 


282 777841 


33. 








o 


1 


10 


1024 


0.06 


3 


3 


53 


264.3 


79.03 


404 649 274 


34. 








o 


1 


10 


2048 


0.10 


5 


5 


50 


237.4 


43.71 


447 576 992 


35. 








o 


o 


10 


512 


0.00 


1 


1 


52 


492.0 


237.16 


607126 560 


36. 








1 


1 


10 


2048 


0.13 


7 


7 


52 


250.9 


31.85 


652 324 553 


37. 








1 


1 


10 


1024 


0.03 


2 


2 


54 


328.5 


122.02 


1249 510 675 


38. 


1 





o 


1 


1 


2048 


0.00 


o 


2 


53 


101.5 


+00 


+oo 


39. 








o 


o 


1 


2048 


0.00 


o 


16 


54 


146.2 


+oo 


+oo 


40. 








1 


o 


1 


512 


0.00 


o 


6 


51 


202.0 


+00 


+00 


41. 


1 





o 


1 


1 


1024 


0.00 


o 


2 


53 


209.0 


+oo 


+oo 


42. 








o 


o 


1 


1024 


0.00 


o 


9 


54 


257.1 


+00 


+00 


43. 








1 





1 


1024 


0.00 


o 


16 


54 


277.3 


+oo 


+oo 


44. 














1 


512 


0.00 


o 


4 


50 


369.5 


+ 00 


+oo 


45. 











1 


1 


1024 


0.00 


o 


o 


53 





+oo 


+oo 


46. 











1 


1 


2048 


0.00 


o 


o 


53 





+ 00 


+oo 


47. 











1 


1 


512 


0.00 


o 


o 


51 





+oo 


+oo 


48. 











1 


10 


512 


0.00 


o 


o 


51 





+oo 


+oo 


49. 








1 





10 


512 


0.00 








52 





+oo 


+oo 


50. 








1 


1 


1 


1024 


0.00 








52 





+00 


+oo 


51. 








1 


1 


1 


2048 


0.00 








54 





+oo 


+oo 


52. 








1 


1 


1 


512 


0.00 








49 





+00 


+oo 


53. 





0.3 





1 


1 


512 


0.00 








52 





+oo 


+oo 


54. 













10 


1024 


0.00 








55 





+00 


+oo 


55. 













10 


2048 


0.00 








49 





+oo 


+oo 


56. 













10 


512 


0.00 








52 





+oo 


+oo 


57. 










1 


1 


512 


0.00 








51 





+oo 


+oo 


58. 










1 


10 


1024 


0.00 








53 





+ 00 


+0O 


59. 










1 


10 


2048 


0.00 








51 





+oo 


+oo 


60. 










1 


10 


512 


0.00 








51 





+oo 


+0O 



Table 21.7: Results of the RBGP test series on the GCD problem. 

Each of the sixty rows of this table denotes one single test series. The first column 
contains the rank of the series according to pt n (.,T, h)e following seven columns specify the 
settings of the test series as discussed in denned Table 21.5 on page 361. The last seven 
columns contain the evaluation results, which are formatted as follows: 

Figure 21.17 illustrates the relation between the functional objective value /i of the cur- 
rently best individual of the runs to the generation for the twelve best test series (according 
to their p/r- values). The curves are monotone for series with constant training sets (ct = 0) 
and jagged for those where the training data changed each generation (ct = 1). 
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false V true => b t +i = b t % a t 
(b t < a t ) V false => a t+ i = a t % b t 
false V true => c t+ i = b t 

Listing 21.3: The RBGP version of the Euclidean algorithm. 



(a t < b t ) 


A 


true 


=> 


startt+i 




1 




startt 


false 


V 


(startt > a t ) 


=> 


startt+i 




startt 


* 





(at = 1) 


A 


(0 > start) 




startt+i 




startt 


/ 


Ct 


true 


A 


(ct = startt) 


=> 


Ct + l 




Ct 


+ 


1 


(c t > 0) 


V 


(a t < b t ) 


=> 


at+i 




at 


* 


startt 


true 


A 


true 




Ct + l 




Ct 




Ct 


false 


V 


(at / startt) 




startt+i 




startt 




startt 


true 


V 


(ct = startt) 


=$■ 


Ct + l 




Ct 


+ 


1 


false 


V 


(0 < startt) 


=> 


bt+i 




bt 


* 


Ct 


(startt = c t ) 


V 


(1 > startt) 




bt+i 




bt 


% 


1 


(0 < 1) 


A 


(0 > 0) 




at+i 




at 


/ 


Ct 


false 


V 


(bt < 0) 




at+i 




1 




at 


(startt < startt) V 


true 




Ct + l 




Ct 


/ 





(a t = startt) A 


true 




Ct + l 




Ct 


+ 





(a t < bt) 


A 


true 




startt+i 




1 




startt 



Listing 21.4: An overfittcd RBGP solution to the GCP problem. 
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500 
Fig. 21.17.a: alg=0, 
cp—1, ss=l, ct=l, 
tc=W, ps=2048 




500 
Fig. 21.17.b: alg=0, 
cp—1, ss=l, ct=Q, 
fc=10, ps=2048 




500 
Fig. 21.17.c: alg=0, 
cp—1, ss—0, ct=l, 
tc=10, ps=2048 




500 

Fig. 21.17.d: alg=0, 
cp—1, ss=0, ct—0, 
te=10, ps=2048 




500 
Fig. 21.17.e: alg=0, 
cp—1, ss=l, ct=l, 
fc=10, ps=1024 



500 
Fig. 21.17.f: alg=0, 

cp—1, 88=1, Ct=0, 

tc=10, ps=1024 




500 

Fig. 21.17.g: alg=0, 
cp—1, ss=0, ct=l, 
te=10, ps=1024 







L 1 




Hit 






500 







Fig. 21.17.h: alg=0, 
cp—1, ss=0, ct=0, 
tc=10, ps=1024 




500 

Fig. 21.17.i: alg=0, 

cp—1, 88=1, Ct=0, 

tc=10, ps=512 




500 

Fig. 21.17-j: alg=0, 

cp—1, ss=l, ct=l, 
tc=10, ps=512 




500 

Fig. 21.17.k: alg=0, 

cp—1, ss=l, ct=l, 
tc=l, ps=2048 




500 

Fig. 21.17.1: alg=0, 

cp—1, 88=1, Ct=0, 

tc=l, ps=1024 



Figure 21.17: The /1 /generation-plots of the best configurations. 
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We have sorted the runs in Table 21.7 according to their pr„-values, i.e., the (estimate of the) 
number of individual evaluations that are consumed by the pt n independent experimental 
runs needed to find a perfect individual with z = 99% probability. Interestingly our first 
approach was to evaluate these experiments according to their p/r-ratio, which led to a 
different order of the elements in the table. 

Population Size The role of the population size ps is not obvious and there is no clear 
tendency which one to prefer when considering the pr„-value only. Amongst the three best 
EAs according to this metric, we can find all three tested population sizes. When we focus 
on the p/r-ratio instead, the four best runs all have a population size of 2048. At least in this 
experiment, the bigger the population, the bigger the chance of success of an experiment 
holds. The significance of this tendency is shown in Table 21.8, Table 21.9, and Table 21.10. 
Of course, this comes with the trade-off that more individuals need to be processed which 
decreases pr n . If we perform multiple runs with smaller populations, we seemingly have a 
higher chance of finding at least one non-overfitted program with lesser objective function 
evaluations, but this trend could not be supported by the tests in the three tables. 



ps = 1024 vs. ps = 512 (based on 19 samples) 

Test according to p/r (higher is better) 

Sign test: med(p/r)| ps=1024 = 0.19, med(p/ r )| ps=512 = 0.09, 

(sec Section 28.8.1) a ~ 0.0063 significant at level a = 0.05 

Randomization test: pT r \ ps =io2i — 0.06, W r \ p! ,=5i2 = 0.0, 

(sec Section 28.8.1) a « 0.0024 significant at level a = 0.05 

Signed rankt test: R(p/ r )\ ps: io24.~5i2 ~ 

114.0, 

(see Section 28.8.1) Q ~ 0.019 =>■ significant at level a = 0.05 
Test according to pr n (lower is better) 

Sign test: med(pr„)| ps=1024 = 1.66 • 10 s , med(pr„)| J)s=512 = +oo, 

(sec Section 28.8.1) a ~ 0.1940 not significant at level a = 0.05 

Randomization test: W^\ ps =w24 = W^\ P ,=5i2 = 

(sec Section 28.8.1) could not be applied 

Signed rankt test: fi(pT„)| J)3:1024 _ 512 = -94.0, 

(sec Section 28.8.1) a ~ 0.0601 => not significant at level a = 0.05 



Table 21.8: ps = 1024 vs. ps = 512 (based on 19 samples) 



21.3 Genetic Programming Problems 367 



ps = 2048 vs. ps = 512 (based on 20 samples) 

Test according to p/r (higher is better) 

Sign test: med(j>/r)\ ps=20ia = 0.115, med(p/r)| ps=sl2 = 0.0, 

(sec Section 28.8.1) a ~ 0.0017 => significant at level a = 0.05 

Randomization test: pT r \ P ,=204s ~ 0.255, pT r \ P s=5i2 = 0.087, 
(sec Section 28.8.1) a ~ 0.0006 => significant at level a = 0.05 

Signed rankt test: -R(p/>-)lj>s:2048-5i2 = 150.0, 

(sec Section 28.8.1) a « 0.0034 =>■ significant at level a = 0.05 

Test according to pr n (lower is better) 

Sign test: med(pT„)| ps=2048 = 2.452 • 10 s , med(pT„)| ps=512 = +oo, 

(sec Section 28.8.1) a ~ 0.293 => not significant at level a = 0.05 

Randomization test: W^\ ps=20is = +°°' W^\ P ,=5i2 = +°°, 
(sec Section 28.8.1) could not be applied 

Signed rankt test: -R(p-r n )| ps;2048 _ 512 = -134.0, 

(sec Section 28.8.1) a ~ 0.0104 =>■ significant at level a = 0.05 

Tabic 21.9: ps = 2048 vs. ps = 512 (based on 20 samples) 

ps = 1024 vs. ps = 2048 (based on 19 samples) 

Test according to p/r (higher is better) 

Sign test: med(p/r)| ps=1024 = 0.06, med(p/r)| p8=2048 = 0.1, 

(sec Section 28.8.1) Q » 0.002 => significant at level a = 0.05 

Randomization test: pT r \ P ,= io24 = 0.186, pT<\ P .,=2048 = 0.255, 
(sec Section 28.8.1) Q ~ o.Oll ^ significant at level a = 0.05 

Signed rankt test: ft(p/OUio24-204 8 = -148.0, 
(sec Section 28.8.1) a ~ 0.002 ^ significant at level a = 0.05 

Test according to pr n (lower is better) 

Sign test: med(pr„)| ! „ =1024 = 1.66 ■ 10 s , med(pr n )\ ps=204S = 2.523B8, 

(sec Section 28.8.1) a ~ 0.1597 => not significant at level a = 0.05 

Randomization test: W^\ p3=102i = +00, pT^\ ps=20is = +00, 
(sec Section 28.8.1) could not be applied 

Signed rankt test: R{pr n)! p ,, : io24-2048 = 

-22.0, 

(sec Section 28.8.1) a ~ 0.6794 =^ not significant at level a = 0.05 

Table 21.10: ps = 1024 vs. ps = 2048 (based on 19 samples) 



Steady State In many experimental runs, a configuration with ss = 1, i. e., steady-state was 
better than the exactly the same configuration with ss = (compare, for instance, ranks 1 
and 14 or rank 2 and 10 in Table 21.7). Also, if we look at the four best runs according to the 
p/i-rate, we can see that the better two of them both have ss = 1 while the other two have 
ss = - while all other parameters remained constant. In Tabic 21.11, these tendencies are 
reflected in the mean, median, and rank values but are not fully supported with sufficient 
evidence in the hypothesis tests. 
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ss = 1 vs. ss = (based on 23 samples) 

Test according to p/r (higher is better) 

Sign test: med(p/r)| 33=1 = 0.12, med(p/r)| S3=0 = 0.09, 

(sec Section 28.8.1) a ~ 0.053 => not significant at level a = 0.05 

Randomization test: pT r \ ss =i = 0.249, pT r \ss=o ~ 0.186, 

(sec Section 28.8.1) Q ~ 0.0076 =>■ significant at level a = 0.05 

Signed rankt test: -R(p/0L:i-o = 125.0, 
(sec Section 28.8.1) a « 0.057 not significant at level a = 0.05 

Test according to pr„ (lower is better) 

Sign test: med(pr n )|„ =1 = 1.665- 10 8 , med(pr„)|„ =0 = 1.679- 10 s , 

(sec Section 28.8.1) a ~ 0.405 => not significant at level a = 0.05 

Randomization test: pf^\ ss=1 = +oo, W^\ ss=a — +oo, 
(sec Section 28.8.1) could not be applied 

Signed rankt test: R(pTn)\ ss -i-o ~ —27.0, 

(sec Section 28.8.1) a ~ 0.692 => not significant at level a = 0.05 

Table 21.11: ss = 1 vs. ss — (based on 23 samples) 



Convergence Prevention The influence of our primitive convergence prevention mechanism 
is remarkable - the top 15 test series according to p/ r all have cp = 0.3, and even gen- 
erational tests with a population size of 512 beat steady-state runs with a population of 
2048 individuals if using convergence prevention. Considering the estimated number pr n of 
individuals that need to be evaluated in pt n independent runs to achieve 99% probability of 
finding a non-ovcrfitted solution, this trend is even more obvious: all of the 23 best Genetic 
Programming approaches had the convergence prevention mechanism turned on. To be more 
precise: all but one single configuration with convergence prevention were better as all EA 
configurations with convergence prevention turned off. This trend is fully supported by the 
hypothesis tests from Table 21.12 for both, p/r and pt n . It seems that keeping the evolution- 
ary process going and not allowing a single program to spread unchanged all throughout the 
population increases the solution quality a lot. 
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cp = 0.3 vs. cp = (based on 23 samples) 

Test according to p/r (higher is better) 

Sign test: med(p/r)| CJ>=0 3 = 0.27, med(p/r)| cp=0 = 0.0, 

(sec Section 28.8.1) a ~ o =>■ significant at level a = 0.05 

Randomization test: pT r \ €p =a.3 = 0.391, p7r| cp=0 = 0.048, 
(see Section 28.8.1) a ~ o =>■ significant at level a = 0.05 

Signed rankt test: R(p/ r )\c P -.o.3-a ~ 

274.0, 

(sec Section 28.8.1) a ~ o =>■ significant at level a = 0.05 
Test according to pr„ (lower is better) 

Sign test: med(pr„)| cp=0 3 = 2.96- 10 7 , med(pT n )| cp=0 = +oo, 

(sec Section 28.8.1) a ~ o =>■ significant at level a = 0.05 

Randomization test: pr^| CJ , =0 . 3 = +00, pr^U=o = +00, 
(sec Section 28.8.1) cowZd not be applied 

Signed rankt test: R(P T n)\ cp:0 . 3 - = -276.0, 
(sec Section 28.8.1) a ~ o => significant at level a = 0.05 

Tabic 21.12: cp = 0.3 vs. cp — (based on 23 samples) 



Number of Training Cases According to the pr n measure, using one training case tc = 1 
is sometimes better than using tc — 10. Then, we fewer individual evaluations are needed 
for finding a non-overfitted individual if fewer training cases are used. Obviously, using 
ten training cases corresponds to ten times as many individual evaluations per generation. 
When comparing row 1 and 7 in Tabic 21.7, the difference in estimated evaluations needed is 
only approximately two, and the configuration of row 23 needs approximately three times as 
many evaluations as row 10. The median in table Table 21.13 points into the other direction: 
because of many zero values for p/r with one training case, these test series perform worse. 
The only applicable test, the sign test, supports that ten training cases are better than one. 

If wc consider the fraction p/r of experiments that led to a perfect individual compared 
to the total number of experiments run for a configuration, this effect becomes even more 
obvious. The number of training cases has a very drastic effect: Then, the top ten test series 
all are based on ten training cases (tc — 10). Tabic 21.13 clearly emphasizes the significance 
of this tendency. 

We can think of a very simple reason for that which can be observed very well when 
comparing for example Fig. 21.17.1 with Fig. 21.17.L In the best series based on only a single 
training case (tc — 1) and illustrated in Fig. 21.17.1, only six values (0..5) for the objective 
function fi could occur. The ninth best series depicted in Fig. 21.17.i on the other hand, had 
a much broader set of values of /i available. Since tc = 10 training cases were used and the 
final objective value assigned to an individual is the average of the scores reached in all these 
tests, it had much lower variations fx with 51 = | {0.0, 0.1, 0.2, . . . , 4.8, 4.9, 5.0} j levels. By 
using multiple training sets for these runs, we have effectively reduced the ruggedness of the 
fitness landscape and made it easier for the evolutionary algorithm to descend a gradient. 
The effect of increasing the resolution of the objective functions by increasing the number 
of training cases has also been reported in other researchers such as Lasarczyk and Banzhaf 
[1258] in the area of Algorithmic Chemistries 18 . 

What we see is that a higher number of training cases decreases overfitting and increases 
the chance of a run to find good solutions. It does, however, not decrease the expected 
number of individuals to be processed until a good solution is found. 



You can find Algorithmic Chemistries discussed in Section 4.8.2 on page 205. 
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tc = 1 vs. tc = 10 (based on 29 samples) 

Test according to p/r (higher is better) 

Sign test: med(p/r)| tc=1 = 0.0, med(p/r)| tc=10 = 0.13, 

(sec Section 28.8.1) a ~ 0.0004 => significant at level a = 0.05 

Randomization test: pT r \tc=i = 0.058, W T \tc=w — 0.273, 
(see Section 28.8.1) could not be applied 

Signed rankt test: -R(p/0ltc:i-io = -362.0, 

(sec Section 28.8.1) a ~ 0.000 02 => significant at level a = 0.05 

Test according to pr n (lower is better) 

Sign test: med(pr n )| lc=1 = +oo, med(pr„)| tc=10 = 2.646- 10 8 , 

(sec Section 28.8.1) a « 0.0241 significant at level a = 0.05 

Randomization test: pf^\ tc=1 = +oo, pf^\ tc=10 = +oo, 
(sec Section 28.8.1) could not be applied 

Signed rankt test: R{pTn)\ tc:1 ^ la = 111.0, 
(sec Section 28.8.1) could not be applied 

Table 21.13: tc = 1 vs. tc = 10 (based on 29 samples) 



Changing Training Cases In this experiment, the EAs with constant training (cf = 0) cases 
seemingly outperform those with training cases that change each generation (ct = 1) ac- 
cording to the pr n metric. This is strange, since one would expect that this approach would 
reduce overfitting and thus, since it does not a priori require more evaluations, improve the 
pr n . Still, only one of tests from Table 21.14 supports the significance of this result. The 
average first success generation st remains roughly constant, regardless if the training data 
changes or not. 

The best ten series according to st all use ten training cases {tc = 10), which seems to 
prevent overfitting sufficiently on its own. There is a difference in tc = 1, though, when 
we compare the perfect runs with those which were just successful. In all runs that find a 
solution x G X with fi(x) = 0, this solution is also correct if ct = 1, i. c., #p = #s. In the test 
series where ct = 0, usually only a fraction of the runs that found an individual with optimal 
functional fitness had indeed found a solution. Here, overfitting takes place and #p < #s can 
be usually observed. 

In the context of this experiment, the parameter ct has no substantial influence on the 
chance of finding a solution to the GCD problem in a run. Using training cases that change 
each generation even has a negatively influence on the pr n values. Maybe the proportion 
of possible programs that are truly correct compared to those that just perform good when 
applied to the training cases due to overfitting is relatively high in this problem. Then, the 
influence of this parameter could be different in other scenarios. 
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ct = vs. ct = 1 (based on 29 samples) 

Test according to p/r (higher is better) 

Sign test: med(p/r)| ct=0 = 0.1, med(p/r)| ct=1 = 0.03, 

(sec Section 28.8.1) a ~ 0.053 => not significant at level a = 0.05 

Randomization test: pT r \ €t =o = 0.191, pTr\ ct=1 = 0.165, 
(see Section 28.8.1) C omW noi 6e applied 

Signed rankt test: #(pa)U:o-i = 86.0, 

(see Section 28.8.1) no t significant 

Test according to pr„ (lower is better) 

Sign test: med(pr„)| ct=0 = 1.665 -10 8 , med(pr n )\ ct=1 = 1.25- 10 9 , 

(sec Section 28.8.1) a ~ 0.458 => not significant at level a = 0.05 

Randomization test: pr^\ ct=a = +oo, pr^\ ct=1 = +oo, 
(sec Section 28.8.1) could not be applied 

Signed rankt test: R{pTn)\ ct:a -i = —313.0, 

(sec Section 28.8.1) a ~ 0.0003 =^ significant at level a = 0.05 

Table 21.14: ct = vs. ct—1 (based on 29 samples) 



Comparison to Random Walks According to the chance p/r that a test series finds a non- 
ovcrfittcd solution, the best 17 configurations all were evolutionary algorithms, and apart 
from the 18th and 26th best series, no random walk made it into the top 30. Strangely, 
two random walks obtain very good placements (third and fifth rank) when considering the 
pr n metric but then, the next best random walk resides on rank 38. The two good random 
walks are configured in a way that leads to few evaluations, which leads to good values of 
pr n when accidentally a good solution was found. They are also the cause why only one of 
the tests in Table 21.15 is really significant. Nevertheless, having two random walks in such 
high placements could either mean that the GCD problem is very hard (so searching with 
an EA is not really better than with a random walk) or very simple (since randomly created 
programs can solve it in many cases). 

Be it how it be, the dominance of Genetic Programming in most of the measurements 
and evaluation results of this problem indicates that there is a benefit of using EAs. One of 
the reasons for many of the bad performances of the random walks was that the individuals 
tended to become unreasonable large. This also increased the amount of time needed for 
evaluation. However, at least sometimes it seems to be a good idea to also try some runs 
which utilize the brute force of random walks when trying to solve a GP problem. 
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alg = vs. alg = 1 (based on 12 samples) 

Test according to p/r (higher is better) 

Sign test: med(p/r)j ai9=0 = 0.0, med(p/r)| ai9=1 = 0.0, 

(see Section 28.8.1) could not be applied 

Randomization test: pT r \ a i s =a — 0.046, pT r \ a i g =i = 0.024, 

(sec Section 28.8.1) a ~ 0.5 not significant at level a = 0.05 

Signed rankt test: #(p/0Li 9 :o-i — —3.0, 
(see Section 28.8.1) a « 0.925 => not significant at level a = 0.05 

Test according to pr n (lower is better) 

Sign test: med(p-r„)| alg=a = +co, med(pT„)| o!s=1 = +00, 

(sec Section 28.8.1) a ~ 0.774 => not significant at level a = 0.05 

Randomization test: pr^\ alg=0 = +oo, pr^\ alg=1 = +od, 
(sec Section 28.8.1) could not be applied 

Signed rankt test: -R(pT"n)L 9:0 -i = —27.0, 

(sec Section 28.8.1) a ~ 0.289 not significant at level a = 0.05 



Table 21.15: alg = vs. alg — 1 (based on 12 samples) 
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Contests 



For most of the problems that can be solved with the aid of computers, a multitude of 
different approaches exist. They are often comparably good and their utility in single appli- 
cation cases strongly depends on parameter settings and thus, the experience of the user. 
Contests provide a stage where students, scientists, and practitioners from the industry can 
demonstrate their solutions to specific problems. They not only provide indications for which 
techniques are suitable for which tasks, but also give incitements and trickle scientific inter- 
est to improve and extend them. The RoboCup 1 , for example is known to be the origin of 
many new, advanced techniques in robotics, image processing, cooperative behavior, multi- 
variate data fusion, and motion controls [114, 780, 115, 1102] 2 . In this chapter, we discuss 
Genetic Programming approaches to competitions like the Data-Mining-Cup or the Web 
Service Challenge. 

22.1 DATA-MINING-CUP 

22.1.1 Introduction 
Data Mining 

Definition 22.1 (Data Mining). Data mining 3 can be defined as the nontrivial extraction 
of implicit, previously unknown, and potentially useful information from data [743] and the 
science of extracting useful information from large data sets or databases [885]. 

Today, gigantic amounts of data are collected in the web, in medical databases, by en- 
terprise resource planning (ERP) and customer relationship management (CRM) systems 
in corporations, in web shops, by administrative and governmental bodies, and in science 
projects. These data sets are way too large to be incorporated directly into a decision making 
process or to be understood as-is by a human being. Instead, automated approaches have 
to be applied that extract the relevant information, to find underlying rules and patterns, 
or to detect time-dependent changes. Data mining subsumes the methods and techniques 
capable to perform this task. It is very closely related to estimation theory in stochastic 
(discussed in Section 28.7 on page 499) - the simplest summary of a data set is still the 
arithmetic mean of its elements. Data mining is also strongly related to artificial intelligence 
[1780, 569], which includes learning algorithms that can generalize the given information. 
Some of the most wide spread and most common data mining techniques are: 

1 http : //www. robocup . org/ [»cc OS! od 2007-07-03] and http://en.wikipedia.org/wiki/Robocup [ac- 
cessed 2007-07-03] 

2 Big up to the Carpe Noctem Robotic Soccer Team founded by my ingenious colleagues Baer and 
Reichle [114] (http://carpenoctem.das-lab.net/ [accessed 2008-04-23])! 

3 http://en.wikipedia.org/wiki/Data_mining [accessed 2007-07-03] 
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1. artificial neural networks (ANN) [207, 210], 

2. support vector machines (SVM) [2107, 2150, 306, 2092], 

3. logistic regression [16], 

4. decision trees [186, 2243], 

5. Learning Classifier Systems as introduced in Chapter 7 on page 233, and 

6. naive Bayes Classifiers [578, 1741]. 

The DATA-MINING- CUP 

The Data-Mining-Cup 4 (DMC) has been established in the year 2000 by the prudsys AG 5 
and the Technical University of Chemnitz b . It aims to provide an independent platform for 
data mining users and data analysis tool vendors and builds a bridge between academic 
science and economy. Today, it is one of Europe's biggest and most influential conferences 
in the area of data mining. 

The Data-Mining-Cup Contest is the biggest international student data mining compe- 
tition. In the spring of each year, students of national and international universities challenge 
to find the best solution of a data analysis problem. Figure 22.1 shows the logos of the DMC 
from 2005 till 2007 obtained from http://www.data-mining-cup.com/ [accessed 2007-07-03]. 




22.1.2 The 2007 Contest Using Classifier Systems 

In Mai 2007, the students Stefan Achler, Martin Gob, and Christian Voigtmann came into 
my office and told me about the DMC. They knew that evolutionary algorithms are methods 
for global optimization that can be applied to a wide variety of tasks and wondered if they 
can be utilized for the DMC too. After some discussion about the problem to be solved, we 
together came up with the following approach which was then realized by them. While we 
are going to talk about our basic ideas and the results of the experiments, a detailed view 
on the implementation issues using the Java Sigoa framework are discussed in Section 26.1 
on page 445. We have also summarized our work for this contest in a technical report [2178]. 

4 The Data-Mining-Cup is a registered trademark of prudsys AG. Der Data-Mining-Cup ist 
eine eingetragene Marke der prudsys AG. http://www.data-mining-cup.com/ [accessed 2007-07-03], 
http://www.data-mining-cup.de/ [accessed 2007-07-03] 

5 http://www.prudsys.de/ [accessed 2007-07-03] 

6 http://www.tu-chemnitz.de [accessed 2007-07-03] (Germany) - By the way, that's the university I've 
studied at, a great place with an excellent computer science department. 
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A Structured Approach to Data Mining 

Whenever any sort of problem should be solved, a structured approach is always advisable. 
This goes for the application of optimization methods like evolutionary algorithms as well 
as for deriving classifiers in a data mining problem. In this section we discuss a few simple 
steps which should be valid for both kinds of tasks and which have been followed in our 
approach to the 2007 DMC. 

The first step is always to clearly specify the problem that should be solved. Parts of this 
specification are possible target values and optimization criteria as well as the semantics of 
the problem domain. The optimization criteria tell us how different possible solutions can be 
compared with each other. If we were to sell tomatoes, for example, the target value (subject 
to maximization) would be the profit. Then again, the semantics of the problem domain allow 
us to draw conclusions on what features are important in the optimization or data mining 
process. When selling tomatoes, for instance, the average weight of the vegetables, their 
color, and maybe the time of the day when we open the store are important. The names 
of our customers on the other hand are probably not. The task of the DMC 2007 Contest, 
outlined in Section 22.1.2, is a good example for such a problem definition. 

Before choosing or applying any data mining or optimization technique, an initial analysis 
of the given data should be performed. With this review and the problem specification, wc 
can filter the data and maybe remove unnecessary features. Additionally, we will gain insight 
in the data structure and hopefully can already eliminate some possible solution approaches. 
It is, of course, better to exclude some mining techniques that cannot lead to good results 
in the initial phase instead of wasting working hours in trying them out to avail. Finding 
solutions with offline evolutionary computation usually takes a while, so we have now to 
decide on one or two solution approaches that are especially promising for the problem 
defined. We have performed this step for the DMC 2007 Contest data in Section 22.1.2 on 
page 377. 

After this, we can apply the selected approaches. Of course, running an optimizer on all 
known sample data at once is not wise. Although we will obtain a result with which we can 
solve the specified problem for all the known data samples, it is possible not a good solution. 
Instead, it may be overfitted and can only process the data we were given. Normally however, 
we will only be provided with fraction of the "real data" and want to find a system that is 
able to perform well also on samples that are not yet known to us. Hence, we need to find 
out whether or not our approach generalizes. Therefore, solutions are derived for a subset 
of the available data samples only, the training data. These solutions are then tested on 
the test set, the remaining samples not used in its creations. 7 The system we have created 
generalizes well if it is rated approximately equally good by the optimization criterion for 
both, the training and the test data. Now we can repeat the process by using all available 
data. We have evolved classifier systems that solve the DMC 2007 Contest according to this 
method in Section 22.1.2 on page 379. 

The students Achler, Gob, and Voigtmann have participated in the 2007 DMC Contest 
and proceeded according to this pattern. In order to solve the challenge, they chose for 
a genetic algorithm evolving a fuzzy classifier system. The results of their participation 
are discussed in Section 22.1.2 on page 382. The following sub-sections are based on their 
experiences and impressions, and reproduce how they proceeded. 

The Problem Definition 

Rebate systems are an important means to animate customers to return to a store in classical 
retail. In the 2007 contest, we consider a check-out couponing system. Whenever a customer 
leaves a store, at the end of her bill a coupon can be attached. She then can use the coupon 
to receive some rebate on her next purchase. When printing the bill at the checkout, there 
are three options for couponing: 

7 See also Section 1.4.8 for this approach. 
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Case N: attach no coupon to the bill, 

Case A: attach coupon type A, a general rebate coupon, to the bill, or 
Case B: attach coupon type B, a special voucher, to the bill. 

The profit of the couponing system is defined as follows: 

1. Each coupon which is not redeemed costs 1 money unit. 

2. For each redeemed coupon of type A, the retailer gains 3 money units. 

3. For each coupon of type B which is redeemed, the retailer gains 6 money units. 

It is thus clear that simply printing both coupons at the end of each bill makes no sense. 
In order to find a good strategy for coupon printing, the retailer has initiated a survey. 
She wants to find out which type of customer has an affinity to cash in coupons and, if 
so, which type of coupon most likely. Therefore the behavior of 50000 customers has been 
anonymously recorded. For all these customers, we know the customer ID, the number of 
redemptions of 20 different coupons and the historic information whether coupon type A, 
coupon type B, or none of them has been redeemed. Cases where both have been cashed in 
are omitted. 
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Figure 22.2: A few samples from the DMC 2007 training data. 



Figure 22.2 shows some samples from this data set. The task is to use it as training data 
in order to derive a classifier C that is able to decide from a record of the 20 features whether 
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a coupon A, B, or none should be provided to a customer. This means to maximize the 
profit P(C) of retailer gained by using the classifier C which can be computed according to 

P{C) = 3 * AA + 6 * 55-1* (NA + NB + BA + AB) (22.1) 

where 

1. A A is the number of correct assignments for coupon A. 

2. BB is the number of correct assignments for coupon B. 

3. NA is the number of wrong assignments to class A from the real class N. 

4. NB is the number of wrong assignments to class B from the real class N. 

5. BA is the number of wrong assignments to class A from the real class B. 

6. AB is the number of wrong assignments to class B from the real class A. 

Wrong assignments from the classes A and B to N play no role. 

The classifier built with the 50000 data samples is then to be applied to another 50000 
data samples. There however, the column Coupon is missing and should be the result of the 
classification process. Based on the computed assignments, the profit score P is calculated 
for each contestant by the jury and the team with the highest profit will win. 

Initial Data Analysis 

The test dataset has some properties which make it especially hard for learning algorithms 
to find good solutions. Figure 22.3 for example shows three data samples with exactly the 
same features but different classes. In general, there is some degree of fuzzyness and noise, 
and clusters belonging to different classes overlap and contain each other. Since the classes 
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Figure 22.3: DMC 2007 sample data - same features but different classes. 



cannot be separated by hyper-planes in a straightforward manner, the application of neural 
networks and support vector machines becomes difficult. Furthermore, the values of the 
features take on only four different values and are zero to 83.7%, as illustrated in Table 22.1. 
In general, such a small number of possible feature values makes it hard to apply methods 



value 


number of occurrences 





837119 


1 


161936 


2 


924 


3 


21 





Table 22.1: Feature- values in the 2007 DMC training sets. 

that are based on distances or averages. Stefan, Martin, and Christian had already come to 
this conclusion when we met. At least, one positive fact can easily be found by eyesight when 
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inspecting the training data: the columns C6, C14, and C20, marked gray in Figure 22.2, are 
most probably insignificant since they are almost always zero and, hence, can be excluded 
from further analysis. The same goes for the first column, the customer ID, by common 
sense. 

The Solution Approach: Classifier Systems 

From the initial data analysis, we can reduce the space of values a feature may take on to 
0, 1, and >1. This limited, discrete range is especially suited for Learning Classifier Systems 
(LCS) discussed in Chapter 7 on page 233. 

Since we already know the target function, P(C), we do not need the learning part of 
the LCS. Instead, our idea was to use the profit P(C) defined in Equation 22.1 directly as 
objective function for a genetic algorithm. 

Very much like in the Pitt-approach [1926, 516, 1912] in LCS, the genetic algorithm 
would be based on a population of classifier systems. Such a classifier system is a list of rules 
(the single classifiers) . A rule contains a classification part and one condition for each feature 
in the input data. We used a two bit alphabet for the conditions, allowing us to encode the 
four different conditions per feature listed in Table 22.2. The three different classes can be 



condition 


condition 


corresponding feature value 


(in genotype) 


(in phenotype) 




00 





must be 


01 


1 


must be > 1 


10 


2 


must be > 1 


11 


3 


do not care (i. e., any value is ok) 



Table 22.2: Feature conditions in the rules. 



represented using two additional bits, where 00 and 11 stands for A, 01 means B, and 10 
corresponds to N. We leave three insignificant features away, so a rule is in total 17*2+2 = 36 
bits small. This means that we need less memory for a classifier system with 17 rules than 
for 10 double precision floating point numbers, as used by a neural network, for example. 

When a feature is to be classified, the rules of a classifier system are applied step by step. 
A rule fits to a given data sample if none of its conditions are violated by a corresponding 
sample feature. As soon as such a rule is found, the input is assigned to the class identified 
by the classification part of the rule. This stepwise interpretation creates a default hierarchy 
that allows classifications to include each other: a more specific rule (which is checked before 
the more general one) can represent a subset of features which is subsumed by a rule which 
is evaluated later. If no rule in the classifier systems fits to a data sample, N is returned 
per default since misclassifying an A or B as an N at least does not introduce a penalty in 
P(C) according to Equation 22.1. 

Since the input data is noisy, it turned out to be a good idea to introduce some fuzzyness 
in our classifiers, too, by modifying this default rule. During the classification process, we 
remember the rule which was violated by the least features. In the case that no rule fits 
perfectly, we check if the number of these misfits is less than one fifth of the features, in 
this case ^ w 3. If so, we consider it as a match and classify the input according to the 
rules classification part. Otherwise, the original default rule is applied and N is returned. 
Figure 22.4 outlines the relation of the genotype and phenotype of such a fuzzy classifier 
system. It shows a classifier system consisting of four rules that has been a real result of 
the genetic algorithm. In this graphic, we also apply it to the second sample of the dataset 
that is to be classified. As one can easily see, none of the four rules matches fully - which 
strangely is almost always the case for classifier systems that sprung of the artificial evolution 
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violated condition The second rule wins since it has only 3 violated conditions. 

(and 3 is the allowed maximum for the default rule) 



Figure 22.4: An example classifier for the 2007 DMC. 



performed by us. The data sample, however, violates only three conditions of the second rule 
and, hence, stays exactly at the ^-threshold. Since no other rule in the classifier system has 
less misfit conditions, the result of this classification process will be A. 

Analysis of the Evolutionary Process 



Table 22.3 lists the settings of the evolutionary algorithm that we have applied evolve 
classifiers for the Data-Mining-Cup 2007 problem. 



Parameter 


Short 


Description 


Problem 
Space 


X 


The space of classifiers consisting of between 2 and 55 rules, (see 

Section 22.1.2) 


Objective 
Functions 


F 


F = {/i,/i}, where fi(C) = -P(C) rates the profic and / 2 (C*) = 
max {len(C) , 3} is the non-functional length criterion. 


Search Space 


G 


The variable-length bit strings with a length between 74 and 2035 
bits and a gene size of 37 bits, (see Section 3.5) 


Search 
Operations 


Op 


cr = 70% multi-point crossover, mr = 30% mutation (including 
single-bit flips, insertion, and deletion of genes) 


GPM 


gpm 


(see Figure 22.4) 


Optimization 
Algorithm 


alg 


elitist evolutionary algorithm (see Algorithm 2.2) 


Comparison 
Operator 


cm 


ParetO Comparison (see Section 1.2.2) 


Population 

Size 


ps 


ps = 10 243 


Maximum 
Archive Size 


as 


The size of the archive with the best known individuals was limited 

tO as = 101. (see Definition 2.4) 
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Steady-State 


ss 


The algorithm was generational (not steady-state) (ss = 0). (see 

Section 2.1.6) 


Fitness 

Assignment 

Algorithm 


fa 


For fitness assignment in the evolutionary algorithm, Pareto rank- 
ing was USed. (see Section 2.3.3) 


Selection 
Algorithm 


sel 


A binary (A; = 2) tournament selection was applied, (see Section 2.4.4) 


Convergence 
Prevention 


cp 


No additional means for convergence prevention were used, i. e., 

Cp = 0. (sec Section 2.4.8) 


Generation 
Limit 


mxt 


The maximum number of generations that each run is allowed to 

perform, (see Definition 1.43) 

mxt = 1001 



Table 22.3: The settings of the experiments for the Data-Mining-Cup. 




Figure 22.5: The course of the classifier system evolution. 



Figure 22.5 illustrates the course of the classifier system evolution. We can see a loga- 
rithmic growth of the profit with the generations as well as with the number of rules in the 
classifier systems. A profit of 8800 for the 50 000 data samples has been reached. Experiments 
with 10 000 datasets held back and an evolution on the remaining 40 000 samples indicated 
that the evolved rule sets generalize sufficiently well. The cause for the generalization of the 
results is the second, non-functional objective function which puts pressure into the direction 
of smaller classifier systems and the modified default rule which allows noisy input data. The 
result of the multi-objective optimization process is the Pareto-optimal set. It comprises all 
solution candidates for which no other individual exists that is better in at least one objec- 
tive value and not worse in any other. Figure 22.6 displays some classifier systems which 
are members of this set after generation 1000. CI is the smallest non-dominated classifier 
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31333333011130233 B 
00000000000200022 A 
33111333330332130 A 



CI 



31333333011130233 B 
01201333030333310 N 
00000000000200022 A 
33111333330332133 A 



C2 



03331333011130231 B 
30111233133033133 A 
31133103011313123 B 
02311333332332333 A 
33011103011310123 B 
10300321012202233 B 
10023302313300100 N 
13133032333113230 A 
03213300330031031 N 
03020000013303113 N 
13331332003110200 N 
23213331131003032 A 
11000330203002300 N 
03300220010030331 N 
33113233330032133 A 
31330333011330123 B 
00203301133033010 N 
01201323030333330 N 
30223313301003001 B 
30131230133013133 A 
00113010002133100 B 
30033000311103200 B 
11121311103310003 A 
11313132101000310 B 
13312102313010013 A 
31100222331222302 N 
01333333011130230 B 
31113333100312133 A 
21313101111013100 B 
00000000030200022 A 
33111333330331133 A 
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02311333332332333 A 
03213300330031031 N 
03020000013303113 N 
13331332003110200 N 
13331332003110200 N 
03300220010030331 N 
23213331131003032 A 
03300220010000331 N 
21130320011021302 A 
33113233330032133 A 
10023122212302322 A 
11000330203002300 N 
30210113033032112 N 
11321310200313233 A 
33113233330332133 A 
31330333011330123 B 
30223313301003002 B 
00203301133033010 N 
01201323030333330 N 

30223313301003001 B 
30131230133013133 A 
00113010002133100 B 
30033000311103200 B 
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11313132101000310 B 
13312102313010013 A 
01333333011130230 B 

30223313301003002 B 
31113333100312133 A 
21313101111013100 B 
11330302002121233 B 
32021231303033130 A 
00000000030200022 A 
31133103011313123 B 
13133032333113230 A 
02311333332332333 A 
21313101111013100 B 
10030321130311103 A 
33111330330332133 A 
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Figure 22.6: Some Pareto-optimal individuals among the evolved classifier systems. 



system. It consists of three rules which lead to a profit of 7222. C2, with one additional rule, 
reaches 7403. The 3I-rule classifier system C3 provides a gain of 8748 to which the system 
with the highest profit evolved, C4, adds only 45 to a total of 8793 with a trade-off of 18 
additional rules (49 in total). 

As shown in Table 22.1 on page 377, most feature values are or 1 and there are only 
very few 2 and 3-valued features. In order to find out how different treatment of those 
will influence the performance of the classifiers and of the evolutionary process, we slightly 
modified the condition semantics in Table 22.4 by changing the meaning of rule 2 from > 1 
to < 1 (compare with Table 22.2 on page 378). 

The progress of the evolution depicted in Figure 22.7 exhibits no significant difference to 
the first one illustrated in Figure 22.5. With the modified rule semantics, the best classifier 
system evolved delivered a profit of 8666 by utilizing 37 rules. This result is also not very 
much different from the original version. Hence, the treatment of the features with the values 
2 and 3 does not seem to have much influence on the overall result. In the first approach, 
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condition 


condition 


corresponding feature value 


(in genotype) 


(in phenotype) 




00 





must be 


01 


1 


must be > 1 


10 


2 


must be < 1 


11 


3 


do not care (i. e., any value is ok) 





Table 22.4: Different feature conditions in the rules. 




Figure 22.7: The course of the modified classifier system evolution. 



rule-condition 2 used them as distinctive criterion. The new method treats them the same 
as feature value 1, with slightly worse results. 

Contest Results and Placement 

A record number of 688 teams from 159 universities in 40 countries registered for the 2007 
DMC Contest, from which only 248 were finally able to hand in results. The team of the 
RWTH Aachen won place one and two by scoring 7890 and 7832 points on the contest data 
set. Together with the team from the Darmstadt University of Technology, ranked third, 
they occupy the first eight placements. Our team reached place 29 which is quite a good 
result considering that none of its members had any prior experience in data mining. 

Retrospectively, one can recognize that the winning gains are much lower than those we 
have discussed in the previous experiments. They are, however, results of the classification 
of a different data set the profits in our experiment are obtained from the training sets and 
not from the contest data. Although our classifiers did generalize well in the initial tests, 
they seem to suffer from some degree of overfitting. Furthermore, the systems discussed 
here are the result of reproduced experiments and not the original contribution from the 
students. The system with the highest profit that the students handed in also had gains 
around 8600 on the training sets. With a hill climbing optimizer, we squeezed out another 
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200, increasing, of course, the risk of additional overfitting. In the challenge, the best scored 
score of our team, a total profit of 7453 (5.5% less than the winning team). This classifier 
system was however grown with a much smaller population (4096) than in the experiments 
here, due to time restrictions. 

It should also be noted that we did not achieve the best result with the best single 
classifier system evolved, but with a primitive combination of this system with another 
one: If both classifier systems delivered the same result for a record, this result was used. 
Otherwise, N was returned, which at least would not lead to additional costs (as follows 
from Equation 22.1 on page 377). 

Conclusion 

In order to solve the 2007 Data-Mining-Cup contest we exercised a structured approach. 
After reviewing the data samples provided for the challenge, we have adapted the idea 
of classifier systems to the special needs of the competition. As a straightforward way of 
obtaining such systems, we have chosen a genetic algorithm with two objective functions. 
The first one maximized the utility of the classifiers by maximizing the profit function 
provided by the contest rules. The section objective function minimized a non-functional 
criterion, the number of rules in the classifiers. It was intended to restrict the amount of 
overfitting. The bred classifier systems showed reasonable good generalization properties 
on the test data sets separated from the original data samples, but seem to be overfittcd 
when comparing these results with the profits gained in the contest. A conclusion is that 
it is hard to prevent overfitting in an evolution based on limited sample data - the best 
classifier system obtained will possibly be overfitted. In the challenge, the combination of 
two classifiers yielded the best results. Such combinations of multiple, independent systems 
will probably perform better than each of them alone. 

In further projects, especially the last two conclusions drawn should be considered. Al- 
though we used a very simple way to combine our classifier systems for the contest, it still 
provided an advantage. 

A classifier system in principle is nothing more but an estimator 8 . There exist many 
sophisticated methods of combining different estimators in order to achieve better results 
[88]. The original version of such "boosting algorithms", developed by Schapire [1825], the- 
oretically allows to achieve an arbitrarily low error rate, requiring basic estimators with a 
performance only slightly better than random guessing on any input distribution. The Ad- 
aBoost algorithm by Freund and Schapire [746, 747] additionally takes into consideration the 
error rates of the estimators. With this approach, even classifiers of different architectures 
like a neural network and a Learning Classifier System can be combined. Since the classifi- 
cation task in the challenge required non-fuzzy answers in form of definite set memberships, 
the usage of weighted majority voting [745, 1826], as already applied in a very primitive 
manner, would probably have been the best approach. 

22.2 The Web Service Challenge 

22.2.1 Introduction 

Web Service Composition 

The necessity for fast service composition systems and the overall idea of the WS-Challcngc 
is directly connected with the emergence of Service-Oriented Architectures (SOA). 

Today, companies rely on IT-architectures which are as flexible as their business strategy. 
The software of an enterprise must be able to adapt to changes in the business processes 



See our discussion on estimation theory in Section 28.7 on page 499. 
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like accounting, billing, the workflows, and even in the office software. If external vendors, 
suppliers, or customers change, the interfaces to their IT systems must be newly created 
or modified too. Hence, the architecture of corporate software has to be built with the 
anticipation of changes and updates [796, 1026, 2199]. 

A SOA is the ideal architecture for such systems [1362, 635]. Service oriented architectures 
allow us to modularize the business logic and to implement it in the form of services accessible 
in a network. Services are building blocks for service processes which represent the workflows 
of an enterprise. They can be added, removed, and updated at runtime without interfering 
with the ongoing business. A SOA can be seen as a complex system with manifold services 
as well as n:m dependencies between services and applications: 

1. An application may need various service functionalities. 

2. Different applications may need the same service functionality. 

3. A certain functionality may be provided by multiple services. 

Business now depends on the availability of service functionality, which is ensured by 
service management. Manual service management however becomes more and more cum- 
bersome and ineffective with a rising number of relations between services and applications. 
Here, self-organization promises a solution for finding services that offer a specific function- 
ality automatically. 

Self-organizing approaches need a combination of syntactic and semantic service descrip- 
tions in order to decide whether a service provides a wanted functionality or not. Common 
syntactic definitions like WSDL [249] specify the order and types of service parameters 
and return values. Semantic interface description languages like OWL-S [71] or WSMO 
[1748, 1749] annotate these parameters with a meaning. While WSDL can be used to define 
a parameter myisbn of the type String, with OWL-S we can define that myisbn expects a 
String which actually contains an ISBN. Via a taxonomy we can now deduce that values 
which are annotated as either ISBN-10 or ISBN-13 9 can be passed to this service. 

A wanted functionality is defined by a set of required output and available input pa- 
rameters. A service offers this functionality if it can be executed with these available input 
parameters and its return values contain the needed output values. In order to find such 
services, the semantic concepts of their parameters are matched rather than their syntactic 
data types. 

Many service management approaches employ semantic service discovery [223, 224, 222, 
1314, 1748, 1749, 830, 831]. Still, there is a substantial lack of research on algorithms and 
system design for fast response service discovery. This is especially the case in service com- 
position where service functionality is not necessarily provided by a single service. Instead, 
combinations of services {compositions) are discovered. The sequential execution of these 
services provides the requested functionality. 

The Web Service Challenge 

Since 2005, the annual Web Service Challenge 10 (WS-Challenge, WSC) provides a platform 
for researchers in the area of web service composition to compare their systems and ex- 
change experiences [212, 213, 214]. It is co-located with the IEEE Conference on Electronic 
Commerce (CEC) and the IEEE International Conference on e- Technology, e-Commerce, 
and e-Service (EEE). 

Each team participating in this challenge provides one software system. A jury then uses 
these systems to solve different, complicated web service discovery and composition tasks. 
The major evaluation criterion for the composers is the speed with which the problems are 
solved. Another criterion is the completeness of the solution. Additionally, there is also a 
prize for the best overall system architecture. 

9 There are two formats for International Standard Book Numbers (ISBNs), ISBN-10 and ISBN-13, 

see also http://en.wikipedia.org/wiki/Isbn [accessed 2007-09-02]. 
10 see http: //www. ws-challenge . org/ [accessed 2007-09 02] 
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Figure 22.8: The logo of the Web Service Challenge. 



22.2.2 The 2006/2007 Semantic Challenge 

We have participated in the 2006 and 2007 Web Service Challenges [225, 226]. Here we 
present the system, algorithms and data structures for semantic web service composition 
that we applied in both challenges. A slightly more thorough discussion of this topic can 
be found in Weise et al. [2179, 2184]. The tasks of the 2006 Web Service Challenge in San 
Francisco, USA and the 2007 WSC in Tokyo, Japan are quite similar and only deviate in 
the way in which the solutions have to be provided by the software systems. Hence, we will 
discuss the two challenges together in this single section. Furthermore, we only consider the 
semantic challenges, since they are more demanding than mere syntactic matching. 

Semantic Service Composition 

In order to discuss the idea of semantic service composition properly, we need some prereq- 
uisites. Therefore, let us initially define the set of all semantic concepts M. All concepts that 
exist in the knowledge base are members of M and can be represented as nodes in a wood 
of taxonomy trees. 

Definition 22.2 (subsumes). Two concepts A, B € M can be related in one of four possible 
ways. We define the predicate subsumes : M x M i— » B to express this relation as follows: 

1. subsumes(A, B) holds if and only if A is a generalization of B (B is then a specialization 
of A). 

2. subsumes(i?, A) holds if and only if A is a specialization of B (B is then a generalization 
of A). 

3. If neither subsumes (A, B) nor subsumes(-B, A) holds, A and B are not related to each 
other. 

4. subsumes(A, B) and subsumes(S, A) is true if and only if A = B (antisymmetrie, as 
defined in Equation 27.59 on page 463). 

The subsumes relation is transitive (see Equation 27.55 on page 462), and so are gener- 
alization and specialization: If A is a generalization of B (subsumcs(A, B)) and B is a gen- 
eralization of C (subsumes(-B, C)), then A is also a generalization of C (subsumes(A, C)). 
The same goes vice versa for specialization, here we can define that if A is a specialization 
of B (subsumes(i?, A)) and A is also a specialization of C (subsumes (C, A)), then either 
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subsumes(-B, C) or subsumes ( C, B) (or both) must hold,i. e., either C is a specialization of 
B, or B is a specialization of C, or B = C. 

If a parameter x of a service is annotated with A and a value y annotated with B is 
available, we can set x = y and call the service only if subsumes(A, B) holds (contravariance) . 
This means that x expects less or equal information than given in y. The hierarchy defined 
here is pretty much the same as in object-oriented programming languages. If we imagine 
A and B to be classes in Java, subsumes (A, B) can be considered to be equivalent to the 
expression A. class . isAssignableFrom(B . class) . If it evaluates to true, a value y of type B 
can be assigned to a variable x of type A since y instanceof A will also be true. 

From the viewpoint of a composition algorithm, there is no need for a distinction between 
parameters and the annotated concepts. The set § contains all the services s known to the 
service registry. Each service s£S has a set of required input concepts s.in C M and a set 
of output concepts s.out C M which it will deliver on return. We can trigger a service if we 
can provide all of its input parameters. 

Similarly, a composition request R always consists of a set of available input concepts 
R.in C M and a set of requested output concepts R.out C M. A composition algorithm in 
the sense of the Web Service Challenges 2006 and 2007 discovers a (topologically sorted 11 ) 
set of n services S = {si, S2, ■ ■ ■ , s n } : si, . . . , s n G 8. As shown in Equation 22.2, the first 
service (sq) of a valid composition can be executed with instances of the input concepts 
R.in. Together with R.in, its outputs (sl.out) arc available for executing the next service 
(.S2) in S, and so on. The composition provides outputs that are either annotated with 
exactly the requested concepts R.out or with more specific ones (covariance) . Assuming 
that R.in n R.out = 0, for each composition solving the request R, the predicate isGoal(5) 
will hold. With Equation 22.2, we have defined the goal predicate which we can use in any 
form of informed or uninformed state space search (see Chapter 17 on page 289). 

isGoal(S) <^VA e si.in 3B e R.in : subsumes ( A, B) A 

\/A e Si.in, i € {2..n} 3B e R.in U Si-\.out U .. U s\.out : subsumes(A, B) A 
\/A e R.out 3B e si.out U .. U s n .out : subsumcs(A, B) (22.2) 

The Problem Definition 

In the 2006 and 2007 Web Service Challenge, the composition software is provided with 
three parameters: 

1. A concept taxonomy to be loaded into the knowledge base of the system. This taxonomy 
was stored in a file of the XML Schema format [641]. 

2. A directory containing the specifications of the service to be loaded into the service 
registry. For each service, there was a single file given in WSDL format [249]. 

3. A query file containing multiple service composition requests R\,R%,. ■ . in a made-up 
XML [284] format. 

These formats are very common and allow the contestants to apply the solutions in 
real world applications later as well as using customized versions of their already existing 
applications. The expected result to be returned by the software was also a stream of data 
in a proprietary XML dialect containing all possible service compositions that solved the 
queries according to Equation 22.2. It was possible that a request Ri was resolved by multiple 
service compositions. In the 2006 challenge, the communication between the jury and the 
programs was via command line or other interfaces provided by the software, in 2007 a web 
service interface was obligatory. 

11 The set S is only partially ordered since, in principle, some services may be executed in parallel 
if they do not depend on each other. 
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We will not discuss the data formats used in this challenge any further since they are 
replaceable and do not contribute to the way the composition queries are solved. Remarkably, 
however, were the following restrictions in the challenge tasks: 

1. There exists at least one solution for each query. 

2. The services in the solutions are represented as a sequence of sets. Each set contains 
equivalent services. Executing one service from each set forms a valid composition S. 
This representation does not allow for any notation of parallelization. 

Before we elaborate on the solution itself, let us define the operation "promising" which 
obtains the set of all services s£§ that produce an output parameter annotated with the 
concept A (regardless of their inputs). 

VA e §, Vs e promising(A) ^ 3B e s.out : subsumes(A, B) (22.3) 

The composition system that we have applied in the 2007 WSC consists of three types 
of composition algorithms. The problem space X that they investigate is basically the set 
of all possible permutations of all possible sets of services. The power set P(S) includes all 
possible subsets of S. Xis the set of all possible permutations of the elements in such subsets, 
in other words X C {V permutation(£) : £ e V(S)}. 

An (Uninformed) Algorithm Based on IDDFS 

The most general and straightforward approach to web service composition is the unin- 
formed search, an iterative deepening depth-first search (IDDFS) algorithm as discussed 
in Section 17.3.4 on page 294. Uninformed search algorithms do not make use of any in- 
formation different from goal predicates as defined in Equation 22.2. We can build such a 
composition algorithm based on iterative deepening depth-first search. It is only fast in find- 
ing solutions for small service repositories but optimal if the problem requires an exhaustive 
search. Thus, it may be used by the strategic planner in conjunction with another algorithm 
that runs in parallel if the size of the repository is reasonable small or if it is unclear whether 
the problem can actually be solved. 

Algorithm 22.1 (webServiceCompositionlDDFS) builds a valid web service composition 
starting from the back. In each recursion, its internal helper method dl_dfs_wsc tests all 
elements A of the set wanted of yet unknown parameters. It then iterates over the set of all 
services s that can provide A. For every single s, wanted is recomputed. If it becomes the 
empty set 0, we have found a valid composition and can return it. If dLdfs_wsc is not able 
to find a solution within the maximum depth limit (which denotes the maximum number 
of services in the composition), it returns 0. The loop in Algorithm 22.1 iteratively invokes 
dl_dfs_wsc by increasing the depth limit step by step, until a valid solution is found. 

An (Informed) Heuristic Approach 

The IDDFS-algorithm just discussed performs an uninformed search in the space of possible 
service compositions. As we know from Section 17.4 on page 295, we can increase the search 
speed by defining good heuristics and using domain information. Such information can easily 
be derived in this research area. Therefore, we will again need some further definitions. 
Notice that the set functions specified in the following does not need to be evaluated every 
time they are queried, since we can maintain their information as meta-data along with the 
composition and thus save runtime. 

Let us first define the set of unsatisfied parameters wanted(5) C M in a candidate 
composition S as 

wanted(S') {A : s { e S A A e s,.m A A (jL R.in A ($Sj € S : < j < i A A e Sj.out)} U 
R.out \ [R.in U Uvses s.out) 

(22.4) 
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Algorithm 22.1: S < — webServiceCompositionIDDFS(i?) 
Input: R: the composition request 

Data: maxDepth, depth: the maximum and the current search depth 

Data: in, out: current parameter sets 

Output: S: a valid service composition solving R 

begin 

maxDepth < — 2 
repeat 

S < — dl_dfs_wsc(i?in, R.out, 0, 1) 
maxDepth < — maxDepth + 1 
until 5/0 

Subalgorithm 5 < — dl_dfs_wsc(in, out, composition, depth) 
begin 

foreach ^4 G oui do 

foreach s G promising(A) do 
wanted < — out 
foreach B G wanted do 
| if 3C G s.out : subsumes(B, C) then wanted < — wanted \ {B} 

foreach D G s.in do 
! if J-E G in : subsumes(D, E) then wanted < — wanted U {D} 

comp < — s © composition 
if wanted = then 

i return comp 
else 

if depth < maxDepth then 
comp « — dLdfs_wsc(m, wanted, comp, depth + 1) 
if comp ^ then return comp 



return 



end 



24 end 



In other words, a wanted parameter is either an output concept of the composition query 
or an input concept of any of the services in the composition candidate that has not been 
satisfied by neither an input parameter of the query nor by an output parameter of any 
service. Here we assume that the concept A wanted by service s is not also an output 
parameter of s. This is done for simplification purposes - the implementation has to keep 
track of this possibility. 

The set of eliminated parameters of a service composition contains all input parameters 
of the services of the composition and queried output parameters of the composition request 
that already have been satisfied. 

eliminated^) = I R.out U (J s.in J \ wanted^) (22.5) 

Finally, the set of known concepts is the union of the input parameters defined in the 
composition request and the output parameters of all services in the composition candidate. 

known(S') = R.in U [j s.out (22.6) 

VsGS 

Instead of using these sets to build a heuristic function h, we can derive a comparator 
function cmp msc directly. This comparator function has the advantage that we also can apply 
randomized optimization methods like evolutionary algorithms based on it. 
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Algorithm 22.2: r < — cmp !iJSC (5i, S 2 ) 



Input: Si, &: two composition candidates 

Output: r G Z: indicating whether Si (r < 0) or S2 (r > 0) should be expanded next 

1 begin 

i\ < — len(wanted(Si)) 
i 2 < — len(wanted(S2)) 
if i\ — then 

if «2 = then return len(Si) — lcn(S2) 
else return —1 



if «2 = then return 1 

ei < — len(eliminated(Si)) 
e-2 < — len(eliminated(S2)) 
if ei > e2 then return 1 
else if ei < e2 then return —1 
if i\ < i'2 then return —1 
else if i\ < 12 then return 1 

if len(Si) j= len(S 2 ) then return len(Si) - len(S 2 ) 
return len(known(Si)) — len ( known (S2)) 
16 end 



Algorithm 22.2 defines cmp msc which compares two composition candidates S\ and S 2 . 
This function can be used by a greedy search algorithm in order to decide which of the two 
possible solutions is more prospective. cmp u , sc will return a negative value if Si seems to be 
closer to a solution than S 2 , a positive value if S2 looks as if it should be examined before 
Si , and zero if both seem to be equally good. 

First, it compares the number of wanted parameters. If a composition has no such un- 
satisfied concepts, it is a valid solution. If both, Si and S 2 are valid, the solution involving 
fewer services wins. If only one of them is complete, it also wins. Otherwise, both candidates 
still have unsatisfied concepts. Only if both of them have the same number of satisfied pa- 
rameters, we again compare the wanted concepts. For us, it was surprising that using the 
number of already satisfied concepts as comparison criterion with a higher priority than the 
number of remaining unsatisfied concepts. However, if we do so, the search algorithms per- 
form significantly faster. If their numbers are also equal, we prefer the shorter composition 
candidate. If even the compositions are of the same length, we finally base the decision of 
the total number of known concepts. The interesting form of this comparator function is 
maybe caused by the special requirements of the WSC data. Nevertheless, it shows which 
sorts of information about a composition can be incorporated into the search. 

Using such the comparator function cmp ffiS( ,, we can customize the greedy search ap- 
proach defined it Algorithm 17.6 on page 296 for web service composition. The function 
greedyComposition defined in Algorithm 22.3 performs such a greedy compositing by main- 
taining an internal list which is descendingly sorted according to cmp u)sc . In each iteration, 
the last element is popped from the list and either returned (if it is a valid composition) or 
expanded by appending services providing wanted concepts. 

An Evolutionary Approach 

In order to use a evolutionary algorithm to breed web service compositions, we first need 
to define a proper genome G able to represent service sequences. A straightforward yet 
efficient way is to use (variable-length) strings of service identifiers which can be processed 
by standard genetic algorithms (see Section 3.5 on page 149). A service can be identified by 
a number from No denoting its index in the list of all services in the registry. The genotype- 
phenotype mapping transforming the genotypes g E & which are sequences of such identifiers 
to sequences of services, i. e., the phenotypes S € X, is thus trivial. 
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Algorithm 22.3: S < — greedyComposition(i?) 



Input: R: the composition request 

Data: X: the descendingly sorted list of compositions to explore 
Output: S: the solution composition found, or 



begin 

X 



IWfl.o^t Promising^) 
while X ± do 

X < — sortListd(X, wanted) 

S < X[lon(Jf)-l] 

X < — deleteListItem(X, len(X) - 1) 
if isGoal(S') then 
i return S 

foreach A £ wanted(5) do 

foreach s £ promising(^4) do 
| X < — addListItem(X, s S) 



return 



13 end 



Because of the well-known string form, we could apply the standard creation, mutation, 
and crossover operators. However, by specifying a specialized mutation operation, we can 
make the search more efficient. This new operation either deletes the first service in S (via 
mutate^sci) or adds a promising service to 5* (as done in mutate msc2 ). Using the adjustable 
variable a as a threshold we can tell the search whether it should prefer growing or shrinking 
the solution candidates. 

mutatc tusc2 (5) = s ® S : A g wanted(5) A s g promising(v4) (22.8) 

mutate_(S) ee ( if ^om u () > o 

K 1 \ mutate TOSc2 (5) otherwise v ; 

A new create operation for building the initial random configurations can be defined as 
a sequence of mutate MSc2 invocations of random length. Initially, mutate MSc2 (0) will return 
a composition consisting of a single service that satisfies at least one parameter in R.out. 
We iteratively apply mutate wsc2 to its previous result a random number of times in order to 
create a new individual. 

The Comparator Function and Pareto Optimization 

As driving force for the evolutionary process we can reuse the comparator function cmp msc 
as specified as for the greedy search in Algorithm 22.2 on the preceding page. It combines 
multiple objectives, putting pressure towards the direction of 

1. compositions which are complete, 

2. small compositions, 

3. compositions that resolve many unknown parameters, and 

4. compositions that provide many parameters. 

On the other hand, we could as well separate these single aspects into different objective 
functions and apply direct Pareto optimization. This has the drawback that it spreads the 
pressure of the optimization process over the complete Pareto frontier 12 . 

12 See Section 1.2.2 on page 33 for a detailed discussion on the drawbacks of pure Pareto optimiza- 
tion. 
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Figure 22.9: A sketch of the Pareto front in the genetic composition algorithm. 



Figure 22.9 visualizes the multi-objective version of the optimization problem "web ser- 
vice composition" by sketching a characteristic example for Pareto frontiers of several gen- 
erations of an evolutionary algorithm. We concentrate on the two dimensions composition 
size and number of wanted (unsatisfied) parameters. Obviously, we need to find compositions 
which are correct, i. e., where the latter objective is zero. On the other hand, an evolution 
guided only by this objective can (and will) produce compositions containing additional, 
useless invocations of services not related to the problem at all. The size objective is thus 
also required. 

In the manufactured example depicted in Figure 22.9, the first five or so generations are 
not able to produce good compositions yet. We just can observe that longer compositions 
tend to provide more parameters (and have thus a lower number of wanted parameters). 
In generation 20, the Pareto frontier is pushed farther forward and touches the abscissa 
- the first correct solution is found. In the generations to come, this solution is improved 
and useless service calls are successively removed, so the composition size decreases. There 
will be a limit, illustrated as generation 50, where the shortest compositions for all possible 
values of wanted are found. From now on, the Pareto front cannot progress any further and 
the optimization process has come to a rest. 

As you can see, pure Pareto optimization does not only seek for the best correct solution 
but also looks for the best possible composition consisting of only one service, for the best 
one with two service, with three services, and so on. This spreading of the population of 
course slows down the progress into the specific direction where wanted(S') decreases. 

The comparator function cnxp wsc proven to be more efficient in focusing the evolution on 
this part of the problem space. The genetic algorithm based on it is superior in performance 
and hence, is used in our experiments. 

Experimental Results 

In Table 22.5, we illustrate the times that the algorithms introduced in this section needed 
to perform composition tasks of different complexity 1 ' 5 . We have repeated the experiments 
multiple times on an off-the-shelf PC 14 and noted the mean values. The times themselves 
are not so important, rather are the proportions and relations between them. 

13 The test sets used here are available at http://www.it-weise.de/documents/files/ 
BWG2007WSC_software.zip [acceded June 26, 2009]. Well, at least partly, I've accidentally deleted 
set 12 and 13. Sorry. 

14 2GHz, Pentium IV single core with Hyper- Threading, lGiB RAM, Windows XP, Java 1.6.0._03- 
b05 
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Table 22.5: Experimental results for the web service composers. 



The IDDFS approach can only solve smaller problems and becomes infcasible very fast. 
When building simpler compositions though, it is about as fast as the heuristic approach, 
which was clearly dominating in all categories. A heuristic may be misleading and (although 
this didn't happen in our experiments) could lead to a very long computation time in the 
worst case. Furthermore, if a problem cannot be solved, the heuristic will cannot be faster 
than an uninformed search. Thus, we decided to keep both, the IDDFS and the heuristic 
approach in our system and run them in parallel on each task if sufficient CPUs are available. 

The genetic algorithm (population site 1024, tournament selection) was able to resolve 
all composition requests correctly for all knowledge bases and all registry sizes. It was able 
to build good solutions regardless how many services had to be involved in a valid solution 
(solution depth). In spite of this correctness, it always was a magnitude slower than the 
greedy search which provided the same level of correctness. 

If the compositions would become more complicated or involve quality of service (QoS) 
aspects, it is not clear if these can be resolved with a simple heuristic. Then, the genetic 
algorithm could outperform greedy search approaches. 

Architectural Considerations 

In 2007, we introduced a more refined version [226] of our 2006 semantic composition system 
[225]. The architecture of this composer, as illustrated in Figure 22.10, is designed in a 
very general way, making it not only a challenge contribution but also part of the ADDO 
web service brokering system [222, 223, 224]: In order to provide the functionality of the 
composition algorithms to other software components, it was made accessible as a Web 
Service shortly after WSC'06. The web service composer is available for any system where 
semantic service discovery with the Ontology Web Language for Services (OWL-S) [71] 
or similar languages is used. Hence, this contest application is indeed also a real-world 
application. 
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Figure 22.10: The WSC 2007 Composition System of Bleul et al. [225, 226]. 



An application accesses the composition system by submitting a service request (il- 
lustrated by (&)) through its Web Service Interface. It furthermore provides the services 
descriptions and their semantic annotations. Therefore, WSDL and XSD formatted files as 
used in the WSC challenge or OWL-S descriptions have to be passed in ((al) and (a2)). 
These documents are parsed by a fast SAX-based Input Parser (c). The composition process 
itself is started by the Strategy Planer (d). The Strategy Planer chooses an appropriate 
composition algorithm and instructs it with the composition challenge document (e). 

The software modules containing the basic algorithms all have direct access to the Knowl- 
edge Base and to the Service Register. Although every algorithm and composition strategy 
is unique, they all work on the same data structures. One or more composition algorithm 
modules solve the composition requests and pass the solution to a SAX-based Output Writer, 
an XML document generating module (/) faster than DOM serialization. Here it is also pos- 
sible to transform it to, for example, BPEL4WS [989] descriptions. The result is afterwards 
returned through the Web Service Interface (g). 

One of the most important implementation details is the realization of the operation 
"promising" since it is used by all composition algorithms in each iteration step. Therefore, 
we transparently internally merge the knowledge base and the service registry. This step is 
described here because it is very crucial for the overall system performance. 

A semantic concept is represented by an instance of the class Concept. Each instance 
of Concept holds a list of services that directly produce a parameter annotated with it 
as output. The method getPromisingServices(A) of Concept, illustrated in Figure 22.11, 
additionally returns all the Services that provide a specialization of the concept A as out- 
put. In order to determine this set, all the specializations of the concept have to be tra- 
versed and their promising services have to be accumulated. The crux of the routine is that 
this costly traversal is only performed once per concept. Our experiments substantiated 
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Figure 22.11: The Knowledge Base and Service Registry of our Composition System. 



that the resource memory, even for largest service repositories, is not a bottleneck. Hence, 
getPromisingServices caches its results. 

This caching is done in a way that is thread-safe on one hand and does not need 
any synchronization on the other. Each instance X of Concept holds an internal variable 
promisingServices which is initially null. If X . getPromisingServices () is invoked, it first 
looks up if X. promisingServices is null. If so, the list of promising services is computed, 
stored in X. promisingServices, and returned. Otherwise, X. promisingServices is returned 
directly. Since we do not synchronize this method, it may be possible that the list is com- 
puted concurrently multiple times. Each of these computations will produce the same re- 
sult. Although all parallel invocations of x. getPromisingServices () will return other lists, 
their content is the same. The result of the computation finishing last will remain in x 
.promisingServices whereas the other lists will get lost and eventually be freed by the 
garbage collector. Further calls to x. getPromisingServices () always will yield the same, 
lastly stored, result. This way, we can perform caching which is very important for the 
performance and spare costly synchronization while still granting a maximum degree of 
parallclization. 



Conclusions 

In order to solve the 2006 and 2007 Web Service Challenges, we utilized three different 
approaches, an uninformed search, an informed search, and a genetic algorithm. The unin- 
formed search proofed generally unfeasible for large service repositories. It can only provide 
a good performance if the resulting compositions are very short. 

However, in the domain of web service composition, the maximum number of services in 
a composition is only limited by the number of services in the repositories and cannot be 
approximated by any heuristic. Therefore, any heuristic or metaheuristic search cannot be 
better than the uninformed search in the case that a request is sent to the composer which 
cannot be satisfied. This is one reason why the uninformed approach was kept in our system, 
along with its reliability for short compositions. 

Superior performance for all test sets could be obtained by utilizing problem-specific 
information encapsulated in a fine-tuned heuristic function to guide a greedy search. This 
approach is more efficient than the other two tested variants by a magnitude. 
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Genetic algorithms are much slower, but were also always able to provide correct results 
to all requests. To put it simple, the problem of semantic composition as defined in the 
context of the WSC is not complicated enough to fully unleash the potential of evolutionary 
algorithms. They cannot cope with the highly efficient heuristic used in the greedy search. 
We anticipate however, that, especially in practical applications, additional requirements will 
be imposed onto a service composition engine. Such requirements could include quality of 
service (QoS), the question for optimal parallelization, or the generation of complete BPEL 
[1071] processes. In this case, heuristic search will most probably become insufficient but 
genetic algorithms and Genetic Programming [1196] will still be able to deliver good results. 



23 



Real- World Applications 



In this chapter, we will explore real-world applications of global optimization techniques. 
Such applications are well-researched and established to a point where people are willing to 
bet money on them. They can safely be utilized in a productive system. Some of these areas 
where global optimization algorithms are applied in a practical fashion, aiding scientists and 
engineers with their work, are summarized in this chapter. 

23.1 Symbolic Regression 

In statistics, regression analysis examines the unknown relation ip : G K" h^g 1 of a depen- 
dent variable i/Glto specified independent variables x G R m . Since ip is not known, the 
goal is to find a reasonable good approximation ip*. 

Definition 23.1 (Regression). Regression 1 [1150, 739, 631, 595] is a statistic technique 
used to predict the value of a variable which is dependent one or more independent variables. 

The result of the regression process is a function ip* : R m M that relates the m 
independent variables (subsumed in the vector x to one dependent variable y ip*(x). The 
function ip* is the best estimator chosen from a set of candidate functions ip : M. m i— ► ip. 
Regression is strongly related to the estimation theory outlined in Section 28.7 on page 499. 
In most cases, like linear 2 or nonlinear* regression, the mathematical model of the candidate 
functions is not completely free. Instead, we pick a specific one from an array of parametric 
functions by finding the best values for the parameters. 

Definition 23.2 (Symbolic Regression). Symbolic regression [1196, 87, 2270, 1791, 1792, 
606, 607, 1112, 1699] is one of the most general approaches to regression. It is not limited 
to determining the optimal values for the set of parameters of a certain array of functions. 
Instead, regression functions can be constructed by combining elements of a set of mathe- 
matical expressions, variables and constants. 

23.1.1 Genetic Programming: Genome for Symbolic Regression 

One of the most widespread methods to perform symbolic regression is to apply Genetic 
Programming. Here, the candidate functions are constructed and refined by an evolutionary 
process. In the following we will discuss the genotypes (which are also the phenotypes) of the 
evolution as well as the objective functions that drive it. As illustrated in Figure 23.1, the 
solution candidates, i.e., the candidate functions, are represented by a tree of mathematical 
expressions where the leaf nodes are either constants or the fields of the independent variable 
vector x. 

1 http://en.wikipedia.org/wiki/Regression_analysis [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/Linear_regression [accessed 2007-07-03] 

3 http://en.wikipedia.org/wiki/Nonlinear_regression [accessed 2007-07-03] 
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Figure 23.1: An example genotype of symbolic regression of with i = x£R 1 . 



The set W of functions xp that can possible be evolved is limited by the set of expressions 
£ available to the evolutionary process. 

S = {+, — , *, /, exp, In, sin, cos, max, min, . . . } (23.1) 

Another aspect that influences the possible results of the symbolic regression is the con- 
cept of constants. In general, constants are not really needed since they can be constructed 
indirectly via expressions. The constant 2.5, for example, equals the expression + ln l ^* x ■ 
The evolution of such artificial constants, however, takes rather long. Koza [1196] has there- 
fore introduced the concept of ephemeral random constants. 

Definition 23.3 (Ephemeral Random Constants). If a new individual is created and a 
leaf in its expression-tree is chosen to be an ephemeral random constant, a random number 
is drawn uniformly distributed from a reasonable interval. For each new constant leaf, a new 
constant is created independently. The values of the constant leafs remain unchanged and 
can be moved around and copied by crossover operations. 

According to Koza's idea ephemeral random constants remain unchanged during the 
evolutionary process. In our work, it has proven to be practicable to extend his approach by 
providing a mutation operation that changes the value c of a constant leaf of an individual. A 
good policy for doing so is by replacing the old constant value c Q \d by a new one c new which 
is a normally distributed random number with the expected value c u (see Definition 28.70 
on page 528): 

c new — random„ (c oW ,cr 2 ) (23.2) 
a 2 = e -^ndom„(o,io) # | CoW | (23 3) 

Notice that the other reproduction operators for tree genomes have been discussed in 
detail in Section 4.3 on page 162. 



23.1.2 Sample Data, Quality, and Estimation Theory 

In the following elaborations, we will reuse some terms that we have applied in our discussion 
on likelihood in Section 28.7.2 on page 500 in order to find out what measures will make 
good objective functions for symbolic regression problems. 

Again, we are given a finite set of sample data A containing n — \A\ pairs of (xj,j/j) 
where the vectors Xj € K m are known inputs to an unknown function <p : R m i— ► R and 
the scalars yi are its observed outputs (possible contaminated with noise and measurement 
errors subsumed in the term r/i, see Equation 28.237 on page 500). Furthermore, we can 
access a (possible infinite large) set \P of functions ip : R m h- > R g & which are possible 
estimators of ip. For the inputs Xj, the results of these functions ip deviate by the estimation 
error e (see Definition 28.53 on page 499) from the y\. 
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Vl = <p(xi) + t]i Vz G [0..n - 1] (23.4) 
y t = ^(xj) + Ei(V') W> G t G [0..n - 1] (23.5) 

In order to guide the evolution of estimators (in other words, for driving the regression 
process), we need an objective function that furthers solution candidates that represent the 
sample data A and thus, resemble the function ip, closely. Let us call this "driving force" 
quality function. 

Definition 23.4 (Quality Function). The quality function f(tp,A) defines the quality of 
the approximation of a function ip by a function ip. The smaller the value of the quality 
function is, the more precisely is the approximation of <p by ip in the context of the sample 
data A. 

Under the conditions that the measurement errors ry, are uncorrelated and are all nor- 
mally distributed with an expected value of zero and the same variance (see Equation 28.238, 
Equation 28.239, and Equation 28.240 on page 500), we have shown in Section 28.7.2 
that the best estimators minimize the mean square error MSE (see Equation 28.253 on 
page 502, Definition 28.60 on page 503 and Definition 28.56 on page 499). Thus, if the 
source of the values complies at least in a simplified, theoretical manner with these con- 
ditions or even is a real measurement process, the square error is the quality function to 
choose. 

lcn(A)-l 

f a¥0 ty,A)= ]T (y. t -^(x 4 )) 2 (23.6) 

While this is normally true, there is one exception to the rule: The case where the 
values yi are no measurements but direct results from ip and r\ — 0. A common example 
for this situation is if we apply symbolic regression in order to discover functional identities 
[1785, 1527, 1196] (see also Section 23.1.3). Different from normal regression analysis or 
estimation, we then know ip exactly and want to find another function ip* that is another, 
equivalent form of p. Therefore, we will use ip to create sample data set A beforehand, 
carefully selecting characteristic points Xj. Thus, the noise and the measurement errors rji 
all become zero. If we would still regard them as normally distributed, their variance s 2 
would be zero, too. 

The proof for the statement that minimizing the square errors maximizes the likelihood 
is based on the transition from Equation 28.248 to Equation 28.249 on page 502 where we 
cut divisions by s 2 . This is not possible if a becomes zero. Hence, we may or may not select 
metrics different from the square error as quality function. Its feature of punishing larger 
deviation stronger than small ones, however, is attractive even if the measurement errors 
become zero. Another metric which can be used as quality function in these circumstances 
are the sums of the absolute values of the estimation errors: 

lcn(A)-l 

U=oU>,A)= ]T |j/i-V(xi)| (23.7) 

»=0 



23.1.3 An Example and the Phenomenon of Over-fitting 

If multi-objective optimization can be applied, the quality function should be comple- 
mented by an objective function that puts pressure in the direction of smaller estimations 
ip. In symbolic regression by Genetic Programming, the problem of code bloat (discussed 
in Section 4.10.3 on page 224) is eminent. Here, functions do not only grow large because they 
include useless expressions (like x * x + x — x — 1). A large function may consist of functional 
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expressions only, but instead of really representing or approximating ip, it is degenerated to 
just some sort of misfit decision table. This phenomenon is called overfitting and has initially 
been discussed in Section 1.4.8 on page 72. 

Let us, for example, assume we want to find a function similar to Equation 23.8. Of 
course, we would hope to find something like Equation 23.9. 

y = ip(x) = x 2 + 2x + 1 (23.8) 
y = 4>*{x) = (x + l) 2 = (x + \){x + 1) (23.9) 
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Table 23.1: Sample Data A = {(x,,^) : i £ [0..8]} for Equation 23.8 

For testing purposes, we choose randomly the nine sample data points listed in Table 23.1. 
As result of Genetic Programming based symbolic regression we may obtain something like 
Equation 23.10, outlined in Figure 23.2, which represents the data points quite precisely but 
has nothing to do with the original form of our equation. 

$2 0*0 = (((((0.934911896352446 * 0.258746335682841) - (a;* ((a; / ((x- 
0.763517999368926) + ( 0.0452368900127981 - 0.947318140392111))) / ((x- (x + x)) + 
(0.331546588012695 * (x + a;)))))) + 0.763517999368926) + ((a;- ((( 0.934911896352446 * 
((0.934911896352446 / x) / (x + 0.947390132934724))) + {((x* 0.235903629190878) * (x- 
0.331546588012695)) + ((x* x) + a;))) / x)) * {(({x - (x* (0.258746335682841 / 
0.455160839551232))) / (0.0452368900127981 - 0.763517999368926)) * x) * 
(0.763517999368926 * 0.947318140392111)))) - (((((a; - (a;* (0.258746335682841 / 
0.455160839551232))) / (0.0452368900127981 - 0.763517999368926)) * 0.763517999368926) 
* x) + (x- (x* (0.258746335682841 * 0.934911896352446))))) (23.10) 

We obtained both functions tp* (in its second form) and ip^ using the symbolic regression 
applet of Hanncs Planatscher which can be found at http://www.potschi.de/sr/ [accused 
2007 07 03] 4 . It needs to be said that the first (wanted) result occurred way more often than 
absurd variations like ■02 • But indeed, there are some factors which further the evolution of 
such eyesores: 

1. If only few sample data points are provided, the set of prospective functions that have 
a low estimation error becomes larger. Therefore, chances are that symbolic regression 
provides results that only match those points but differ in all other points significantly 
from cp. 

2. If the sample data points are not chosen wisely, their expressiveness is low. We for 
instance chose 4.9,5, and 5.1 as well as 2.9, 3 and 3.1 which form two groups with 
members very close to each other. Therefore, a curve that approximately hits these two 
clouds is rated automatically with a high quality value. 



4 Another good applet for symbolic regression can be found at http://alphard.ethz.ch/gerber/ 
approx/def ault . html [accessed 2007-07-03] 
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Figure 23.2: tp(x), the evolved ipi(x) = tp(x), and ip^i 1 )- 



3. A small population size decreases the diversity and furthers "incest" between similar 
solution candidates. Due to a lower rate of exploration, only a local minimum of the 
quality value will often be yielded. 

4. Allowing functions of large depth and putting low pressure against bloat 
(see Section 4.10.3 on page 224) leads to uncontrolled function growth. The real laws tp 
that we want to approximate with symbolic regression do usually not consist of more 
than 40 expressions. This is valid for most physical, mathematical, or financial equations. 
Therefore, the evolution of large functions is counterproductive in those cases. 

Although we made some of these mistakes intentionally, there are many situations where 
it is hard to determine good parameter sets and restrictions for the evolution and they occur 
accidentally 

23.1.4 Limits of Symbolic Regression 

Often, we cannot obtain an optimal approximation of tp, especially if tp cannot be repre- 
sented by the basic expressions available to the regression process. One of these situations 
has already been discussed before: the case where tp has no closed arithmetical expression. 
Another possibility is that the regression method tries to generate a polynomial that ap- 
proximates the tp, but tp does contain different expressions like sin or e x or polynomials 
of an order higher than available. Yet another problem is that the values yi are often not 
results computed by tp directly but could, for example, be measurements taken from some 
physical entity and we want to use regression to determine the interrelations between this 
entity and some known parameters. Then, the measurements will be biased by noise and 
systematic measurement errors. In this situation, f(tjj*,A) will be greater than zero even 
after a successful regression. 

23.2 Global Optimization of Distributed Systems 
23.2.1 Introduction 

Optimization algorithms are methods for finding optimal configurations of different features 
of their solution candidates. Many aspects of distributed systems are configurable or depend 
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on parameter settings, such as the topology, security, and routing. Hence, there is a huge 
potential for using global optimization algorithms in order to improve them. 

And indeed, this potential is widely utilized. The study by Sinclair [1886] from 1999 re- 
ported that more than 120 papers had been published on work which employed Evolutionary 
Computation for optimizing network topologies and dimension, node placement, routing, and 
wavelength or frequency allocation. The comprehensive master's thesis by Kampstra from 
2005 [1087, 1088] builds on this aforementioned study and classifies over 400 papers. Ac- 
cording to Kampstra, communication networks was the field with the most researchers listed 
in EvoWeb, the European Network of Excellence in Evolutionary Computing, in 2005. The 
first workshop on this topic, Evolutionary Telecommunications [1889], took place in 1999. 
In the year 2000 alone, two books ([450] and [1630]) have been published on the application 
of Evolutionary Computation to networking. Further summary papers appeared around the 
same time [1851, 1629, 2033, 2109]. The recent studies from Alba and Chicano [31] and 
Cortes Achedad et al. [453] as well as the high number of papers published every year show 
that the interest in applying global optimization techniques in this problem domain has by 
no means decreased. 

Most of the mentioned summaries concentrate on giving an overview in form of a more 
prosaic version of paper listings. We [2186] provide such a listing in a condensed form in 
Section 23.2.3, but focus on giving clear and detailed in-depth discussions of multiple example 
applications and also introduce the optimization algorithms utilized in them. This way, the 
subject becomes more tangible for audience which is rooted in only one the two involved 
subject areas. 

We studied more than 130 papers from two decades of research in evolutionary telecom- 
munication. Figure 23.3 illustrates how these papers distribute over the time from 1987 
to 2008. The papers are classified according to the area of application, their optimization 
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Figure 23.3: The number of papers studied for this survey per year. 



goals, problem representations, and the optimization algorithms utilized. Figure 23.4 gives 
an overview of which areas were tackled by the researchers and which optimization algo- 
rithms were applied in the papers we studied. Here, it is important to notice that a paper 
may deal with multiple applications at once (like routing algorithms which also perform load 
balancing) and thus may appear in multiple columns. The complete subject catalog resulting 
from our survey can be found in Section 23.2.3. Such a list, however, gives only a limited 
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Figure 23.4: The number of papers analyzed, broken down to application area and optimiza- 
tion method. 



idea about the actual approaches that have been developed. Therefore, we will use the fol- 
lowing sections to first take a deeper look into some interesting optimization approaches 
from various areas of distributed systems which stand exemplary for the variety and the 
potential of this field of research. Different methods to synthesize or to improve network 
topologies are outlined in ??, adaptive or evolved routing protocols will be discussed in ??, 
and different approaches to the generation of protocols with global optimization algorithms 
are summarized in Section 23.2.2. In ??, we illustrate some security aspects and how they 
were optimized by different research groups before ending our overview on applications with 
software configuration and parameter adaption approaches in ??. After a representative list 
of publications in from all these research areas (Section 23.2.3), we conclude this summary 
in Section 23.2.4. 
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23.2.2 Synthesizing Protocols 

Protocols like IP [1013] and TCP [362] are the rules for message and information exchange 
in a network. Depending on the application, protocols can become arbitrarily complex and 
strongly influence the efficiency and robustness of a distributed system. 
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Evolving Fraglet-based Protocols 

In Section 4.9.2, we have outlined the Fraglet language. This form of protocol representation 
is predestined for automated synthesis via evolutionary algorithms: Fraglets have almost no 
syntactical constraints and can represent complicated protocols in a compact manner. Sim- 
ilar to us, Tschudin investigated the offline evolution of protocols using a genetic algorithm. 

In his work, a complete communication system was simulated for a given number of 
time steps during the evaluation of each Fraglet program. The objective values denote the 
correlation of the behavior observed during the simulation and the target behavior. Tschudin 
concluded that evolutionary methods are suitable to optimize existing Fraglet protocols, 
but also indicated that the evolution of new distributed algorithms is difficult because of 
a strong tendency to ovcrfitting (see Section 1.4.8) and the all-or-nothing problem known 
from Genetic Programming (see Section 4.10.2 on page 223). 

Online Protocol Adaptation and Evolution with Fraglets 

Autonomic networks are networks where manual management is not desired or hard to 
realize, such as systems of hundreds of gadgets in an e-home, sensor networks, or arbitrary 
mesh networks with wireless and wired links. Yamamoto and Tschudin [2275] pointed out 
that software in such networks should be self-modifying so as to be able to react to unforeseen 
network situations. They distinguish two forms of such reactions - adaption and evolution. 
Adaption is the short-term reconfiguration of existing modules whereas evolution stands 
for the modification of old and the discovery of new functionality and h 
timescale. Software with these abilities probably cannot predict whether the effects of a 
modification are positive or negative in advance and therefore, needs to be resilient in order 
to mitigate faulty code that could evolve. In [2059], Tschudin and Yamamoto showed that 
such resilience can be achieved to a certain degree by introducing redundancy into Fraglet 
protocols. 

Complementing Tschudin's work on offline protocol synthesis and optimization [2058], 
Yamamoto and Tschudin [2275] describe online protocol evolution as a continuously ongo- 
ing, decentralized, and asynchronous process of constant competition and selection of the 
most feasible modules. Genetic Programming with mutation and homologous crossover is 
chosen for accomplishing these features. The fitness measure (subject to maximization) is 
the performance of the protocols as perceived by the applications running in the network. 
The score of a solution candidate (i. e., a protocol) is incremented if it behaves correctly 
and decremented whenever an error is detected. The resource consumption in terms of the 
memory allocated by the protocols is penalized proportionally. 

Yamamoto and Tschudin [2273, 2274] create populations containing a mix of different 
confirmed delivery and reliable delivery protocols for messages. These populations were 
then confronted with either reliable or unreliable transmission challenges and were able to 
adapt to these conditions quickly. If the environment changes afterwards, when a formerly 
reliable channel becomes unreliable, for example, the degree of re-adaptation was, however, 
unsatisfying. The loss of diversity due to the selection of only highly fit protocols during the 
adaptation phase could not yet be compensated by mutation in these first experiments. 

Further information on approaches for evolutionary online optimization of communica- 
tion protocols can be found in the report Framework for Distributed On-line Evolution of 
Protocols and Services, 2nd Edition from the EU-sponsored project BIONETS [1429]. 

23.2.3 Paper List 

In this section, we list the papers concerning the optimization of distributed systems. This 
concise list groups the papers according to the area of application, the optimization goals, 
the problem representations, and the optimization algorithms utilized. This collection lists 
a wide variety of approaches developed by a large number of authors (nearly 200 authors 
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are involved in the papers listed). In our opinion, this heterogeneity and distribution should 
be interpreted as a clear indicator that the optimization of distributed systems lends itself 
to heuristic and metaheuristic approaches. Many papers provide engineering-level solutions 
which often deliver excellent results. 

Topology Optimization and Terminal Assignment 

General Networks or Theory 

1. Abuali et al. [10, 11] (1994) aims: synthesis, costs; representation: integer strings; opti- 
mization methods: evolutionary algorithms and local search, see also ?? 

2. Khuri and Chiu [1133] (1997) aims: synthesis, costs; representations: bit strings and 
integer strings; optimization methods: evolutionary algorithms and local search, see also 
?? 

3. Salcedo-sanz and Yao [1788] (2004) aims: synthesis, costs; representations: bit strings 
and integer strings; optimization method: evolutionary algorithms 

4. Lehmann and Kaufmann [1270, 2335] (2005-2007) aims: synthesis, self-organization, 
QoS features, dynamic or adaptive behavior; representation: information distributed 
over the network; optimization method: evolutionary algorithms, see also ?? 

Computer Networks in General 

5. Michalewicz [1405] (1991) aims: synthesis, robustness; optimization method: evolution- 
ary algorithms, see also ?? 

6. Kumar et al. [1220] (1993) aims: synthesis, robustness, QoS features; representation: bit 
strings; optimization method: evolutionary algorithms, see also ?? 

7. Pierre and Legault [1644, 1645] (1996-1998) aims: synthesis, QoS features, costs; repre- 
sentation: bit strings; optimization method: evolutionary algorithms 

8. Ko et al. [1164] (1997) aims: synthesis, QoS features, costs; representation: bit strings; 
optimization methods: evolutionary algorithms and local search 

9. Dcngiz et al. [555] (1997) aims: synthesis, robustness, costs; representation: integer 
strings; optimization methods: evolutionary algorithms and Memetic Algorithms 

10. Montana et al. [1445] (2002-2004) aims: QoS features, dynamic or adaptive behavior; 
representation: integer strings; optimization method: evolutionary algorithms 

11. Yao et al. [2286] (2005) aims: synthesis, costs; representation: trees; optimization meth- 
ods: evolutionary algorithms and local search, see also ?? 

Wireless or Mobile Computer Networks 

12. Lai et al. [1231, 1232] (2007) aims: synthesis, robustness; representation: integer strings; 
optimization method: evolutionary algorithms 

Telecommunication Networks in General 

13. Dengiz et al., see entry 9. 

14. Pierre and Elgibaoui [1641] (1997) aims: synthesis, robustness, QoS features, costs; op- 
timization method: Tabu Search 

Wireless or Mobile Telecommunication Networks 

15. Pierre and Houeto [1643, 1642] (2002) aims: synthesis, costs; representation: bit strings; 
optimization method: Tabu Search 

16. Quintero and Pierre [1685, 1683, 1684] (2002-2003) aim: costs; representation: inte- 
ger strings; optimization methods: evolutionary algorithms, local search, Memetic Algo- 
rithms, Tabu Search, and Simulated Annealing 
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17. St-Hilaire et al. [1953, 1952] (2006) aims: synthesis, costs; optimization methods: local 
search and Tabu Search 

18. Salcedo-Sanz et al. [1789, 1790] (2008) aims: synthesis, QoS features, costs; representa- 
tion: bit strings; optimization methods: evolutionary algorithms and Memetic Algorithms 

Optical Networks in General 

19. Sinclair [1885, 23, 1887, 1888] (1995-2000) aims: synthesis, robustness, costs; represen- 
tations: bit strings and trees plus genotype-phenotype mappings; optimization method: 
evolutionary algorithms, see also ?? 

20. Brittain et al. [290] (1997) aims: synthesis, costs; representations: bit strings and integer 
strings; optimization methods: evolutionary algorithms and local search 

Node Placement 

21. Alba et al. [35] (2002) aims: synthesis, robustness, costs; representation: bit strings; 
optimization method: evolutionary algorithms 

22. Salcedo-Sanz et al., see entry 18. 

Dimensioning and Capacity Assignment 

Computer Networks in General 

23. Coombs and Davis [441] (1987) aim: QoS features; optimization method: evolutionary 
algorithms, see also ?? 

24. Ko et al., see entry 8. 

25. Martin et al. [1363] (2008) aim: synthesis; representation: integer strings; optimization 
methods: evolutionary algorithms, Extremal Optimization, and Particle Swarm Opti- 
mization 

Telecommunication Networks in General 

26. Martin et al., see entry 25. 

Frequency and Channel Assignment 

27. Tan and Sinclair [2004] (1995) aims: synthesis, costs; representation: bit strings; opti- 
mization methods: evolutionary algorithms and local search 

Protocol Generation and Optimization 

General Networks or Theory 

28. Mackin and Tazaki [1340, 1341, 1342] (1999-2002) aim: synthesis; representation: trees; 
optimization method: evolutionary algorithms, see also ?? 

Computer Networks in General 

29. El-Fakihy et al. [2272, 628] (1995-1999) aims: synthesis, QoS features; representation: 
bit strings; optimization methods: evolutionary algorithms and Memetic Algorithms, see 
also ?? 

30. Sharpies and Wakeman [1860, 1862, 1861] (1999-2001) aims: synthesis, robustness, QoS 
features; representation: bit strings plus genotype-phenotype mappings; optimization 
method: evolutionary algorithms, see also ?? 
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31. Song et al. [1637, 1918] (2000- 2001) aims: synthesis, QoS features; representation: trees; 
optimization method: local search, see also ?? 

32. Grace [847] (2000) aims: synthesis, robustness, QoS features; representation: trees; op- 
timization method: evolutionary algorithms, see also ?? 

33. Yc and Kalyanaraman [2290, 2289] (2001-2003) aims: QoS features, dynamic or adaptive 
behavior; representation: real vectors 

34. Van Belle et al. [2089, 2090] (2001-2003) aims: synthesis, robustness, QoS features; 
representation: bit strings; optimization method: evolutionary algorithms, see also ?? 

35. de Araiijo et al. [504, 505] (2003) aim: synthesis; representation: integer strings; opti- 
mization method: evolutionary algorithms, see also ?? 

36. Tschudin [2058] (2003) aims: synthesis, robustness; representation: linear programs; op- 
timization method: evolutionary algorithms, see also Section 23.2.2 

37. Yamamoto and Tschudin [2274, 2275, 2273] (2005) aims: synthesis, robustness, dynamic 
or adaptive behavior; representations: information distributed over the network and 
linear programs; optimization method: evolutionary algorithms, see also Section 23.2.2 

Wireless or Mobile Computer Networks 

38. Montana and Redi [1444] (2005) aim: QoS features; representation: real vectors; opti- 
mization method: evolutionary algorithms 

39. Weise et al. [2180, 2187] (2007-2008) aims: synthesis, dynamic or adaptive behavior; 
optimization method: evolutionary algorithms 

Routing 

General Networks or Theory 

40. Christcnscn et al. [401] (1997) aims: QoS features, costs; representation: integer strings; 
optimization method: evolutionary algorithms 

41. Kirkwood et al. [1143] (1997) aims: synthesis, robustness; representation: trees; opti- 
mization method: evolutionary algorithms, see also ?? 

42. Zhu et al. [2324] (1998) aim: synthesis; representation: integer strings; optimization 
methods: evolutionary algorithms and local search 

Computer Networks in General 

43. Kirkwood et al., see entry 41. 

44. Munctomo et al. [1487, 1488, 1489, 1484] (1997-1999) aims: self-organization, robust- 
ness, dynamic or adaptive behavior; representations: integer strings and information 
distributed over the network; optimization method: evolutionary algorithms, see also ?? 

45. Ko et al., see entry 8. 

46. Di Caro and Dorigo [561, 560, 559] (1998-2004) aims: self-organization, robustness, 
dynamic or adaptive behavior; representation: information distributed over the network; 
optimization method: ACO/ant agents, see also ?? 

47. Bonabeau et al. [245] (1999) aim: dynamic or adaptive behavior; representation: infor- 
mation distributed over the network; optimization method: ACO/ant agents 

48. Fei et al. [647] (1999) aims: robustness, QoS features; representation: bit strings 

49. Liang et al. [1281, 1282] (2002-2006) aims: robustness, dynamic or adaptive behavior; 
representation: information distributed over the network; optimization methods: evolu- 
tionary algorithms and ACO/ant agents, see also ?? 

50. Sim and Sun [1880] (2002) representation: information distributed over the network; 
optimization method: ACO/ant agents 
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51. Cox, Jr. et al. [461] (1991) aims: QoS features, costs, dynamic or adaptive behavior; 
representation: integer strings; optimization method: evolutionary algorithms 

52. Schoonderwoerd et al. [1832, 1833, 1834] (1996-1997) aims: synthesis, self-organization, 
robustness, QoS features, dynamic or adaptive behavior; representation: information 
distributed over the network; optimization method: ACO/ant agents, see also ?? 

53. Christensen et al., see entry 40. 

54. Zhu et al., see entry 42. 

55. Lukschandl et al. [1329, 1330, 1331] (1999-2000) aims: robustness, costs; representation: 
linear programs; optimization method: evolutionary algorithms, see also Section 4.6.5 

56. Galiasso and Wainwright [759] (2001) aims: synthesis, costs; representation: integer 
strings; optimization methods: evolutionary algorithms and Memetic Algorithms 

57. Sandalidis et al. [1798] (2001) aim: dynamic or adaptive behavior; representation: infor- 
mation distributed over the network; optimization method: ACO/ant agents 

Optical Networks in General 

58. Wang et al. [2155] (2004) aim: QoS features; optimization method: evolutionary algo- 
rithms 

Load Balancing and Call Admission 

Computer Networks in General 

59. Munctomo et al., see entry 44. 

60. Oates and Corne [1553] (2001) aim: QoS features; representation: integer strings; opti- 
mization methods: evolutionary algorithms, local search, and Simulated Annealing 

61. Zapf and Weise [2311, 2310] (2007) aims: synthesis, self-organization; representation: bit 
strings; optimization method: evolutionary algorithms 

Telecommunication Networks in General 

62. Schoonderwoerd et al., see entry 52. 

Peer-To-Peer Systems 

63. lies and Deugo [1011] (2002) aims: robustness, dynamic or adaptive behavior; represen- 
tation: trees; optimization method: evolutionary algorithms, see also ?? 

64. Forcstiero et al. [724, 725, 728, 727, 726, 729] (2005-2008) aims: self-organization, QoS 
features, dynamic or adaptive behavior; representation: information distributed over the 
network; optimization method: ACO/ant agents 

Broadcast and Multicast 

General Networks or Theory 

65. Christensen et al., see entry 40. 

66. Zhu et al., see entry 42. 

67. Cornelias and Gimenez [434] (1998) aims: synthesis, QoS features; representation: trees; 
optimization method: evolutionary algorithms, see also ?? 

Computer Networks in General 

68. Fei et al., see entry 48. 

69. Grace, see entry 32. 
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70. Christensen et al., see entry 40. 

71. Zhu et al., see entry 42. 

72. Galiasso and Wainwright, see entry 56. 

Other 

73. Jaros and Dvorak [1040] (2008) aims: synthesis, QoS features; representation: integer 
strings; optimization methods: Memetic Algorithms and estimation of distribution algo- 
rithms, see also ?? 

Security and Intrusion Detection 

Computer Networks in General 

74. Heady et al. [911] (1990) aim: synthesis; representation: bit strings; optimization method: 
evolutionary algorithms, see also ?? 

75. Song et al., see entry 31. 

76. Song et al. [1919, 1920] (2003) aim: synthesis; representation: linear programs; optimiza- 
tion method: evolutionary algorithms, see also ?? 

77. Liu et al. [1296] (2004) aim: synthesis; optimization method: evolutionary algorithms 

78. Mukkamala et al. [1482] (2004) aims: synthesis, robustness; representation: linear pro- 
grams; optimization method: evolutionary algorithms, see also ?? 

79. Lu and Traorc [1316] (2004) aim: synthesis; representation: integer strings plus genotype- 
phenotypc mappings; optimization method: evolutionary algorithms, see also ?? 

80. Folino et al. [710] (2005) aim: synthesis; representations: trees and linear programs; 
optimization method: evolutionary algorithms, see also ?? 

81. Hansen et al. [887] (2007) aim: synthesis; representation: linear programs; optimization 
method: evolutionary algorithms, see also ?? 

Wireless or Mobile Computer Networks 

82. LaRoche and Zincir-Heywood [1257] (2005) aim: synthesis; representation: linear pro- 
grams; optimization method: evolutionary algorithms, see also ?? 

Agent Cooperation (non-ant) 

General Networks or Theory 

83. Werner and Dyer [2194] (1992) aim: synthesis; representation: integer strings plus 
genotype-phenotype mappings; optimization method: artificial life, see also ?? 

84. Andre [54] (1995) aim: synthesis; representation: trees; optimization method: evolution- 
ary algorithms 

85. Qureshi [1687, 1686, 1688] (1996-2001) aim: synthesis; representation: trees; optimiza- 
tion method: evolutionary algorithms 

86. Iba et al. [984, 987, 985, 986] (1996-1999) aims: synthesis, robustness; representation: 
trees; optimization method: evolutionary algorithms, see also ?? 

87. Mackin and Tazaki, see entry 28. 

Computer Networks in General 

88. Nakano and Suda [1497, 1498, 1499] (2004-2007) aims: self-organization, QoS features, 
dynamic or adaptive behavior; representations: real vectors and information distributed 
over the network; optimization method: evolutionary algorithms, see also ?? 

89. Zapf and Weise, see entry 61. 
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90. Schoonderwoerd et al., see entry 52. 

Software Configuration 

91. Grace, see entry 32. 

92. lies and Deugo, see entry 63. 

93. Xi et al. [2268] (2004) aim: QoS features; representation: integer strings; optimization 
methods: local search and Simulated Annealing, see also ?? 

94. Nakano and Suda, see entry 88. 

Hardware Design and Configuration 

Networks in General 

95. Martin et al., see entry 25. 
Wireless or Mobile Networks in General 

96. Choo et al. [400] (2000) aim: synthesis; representation: bit strings; optimization method: 
evolutionary algorithms 

97. Lohn et al. [1307] (2004) aim: synthesis; representation: real vectors; optimization 
method: evolutionary algorithms 

98. Villcgas et al. [2116] (2004) aim: synthesis; optimization method: evolutionary algorithms 

99. Koza et al. [1212] (2005) aim: synthesis; representation: trees; optimization method: 
evolutionary algorithms 

100. John and Ammann [1057, 1058] (2006) aim: synthesis; representation: bit strings; opti- 
mization method: evolutionary algorithms 

101. Chattoraj and Roy [378] (2006) aim: synthesis; representation: bit strings; optimization 
method: evolutionary algorithms 

Algorithm Synthesis 

Computer Networks in General 

102. Weise et al. [2179, 2184] (2007-2008) aim: synthesis; representation: integer strings; 
optimization methods: evolutionary algorithms and local search 

103. Weise et al. [2181] (2007) aim: synthesis; representation: bit strings; optimization 
method: evolutionary algorithms 

Wireless or Mobile Computer Networks 

104. Weise and Geihs [2176, 2175, 2177] (2001-2006) aims: synthesis, robustness; represen- 
tation: linear programs; optimization method: evolutionary algorithms 

105. Weise et al. [2182, 2183] (2007) aim: synthesis; representation: linear programs; opti- 
mization method: evolutionary algorithms 
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23.2.4 Conclusions 

In this study, we gave a short overview on the wide variety of applications of global opti- 
mization to distributed systems. For the last ten years, this has been one of the most active 
research areas in Evolutionary Computation, with many researchers steadily contributing 
new and enhanced approaches. 

We not only provided a representative list and classification of publications, but also 
introduced many interesting approaches in a detailed way. Yet, we can only offer a small 
glimpse on the real amount of work available. The master's thesis of Kampstra [1087] is 
now already four years old and referenced over 400 papers. From the related work section of 
the papers that we have summarized we know that there should exist at least another 200 
contributions not mentioned in his list or not yet published when it was compiled. 

Practitioners in the area of networking or telecommunication tend to feel skeptical when 
it comes to the utilization of such eerie things like randomized or bio-inspired approaches 
for optimizing, managing, or controlling their systems. One argument against the use of 
mctaheuristics is that the worse case results may be unpredictably bad although they may 
provide good solutions in average. 

Nevertheless, certain problems (like the Terminal Assignment Problem, see ??) are MV- 
hard and therefore can only be solved efficiently with such approaches. This, of course, goes 
hand in hand with a certain trade-off in terms of optimality, for instance. In static design 
scenarios, the worst case situations in which an EA would create inferior solutions can be 
ruled out by checking its results before the actual deployment or realization. 

In practice, additional application-specific constraints are often imposed on standard 
problems. The influence of these constraints on the problem hardness and the applicability 
of the well-known solutions is not always easy to comprehend. Thus, incorporating the 
constraints into a global optimization procedure tends to be much easier than customizing a 
problem-specific heuristic algorithm. Assume that we want to find fast routes in a network 
which are also robust against a certain fraction of failed links. If we have an EA with an 
objective function that measures the time a message travels in a fully functional network, 
it is intuitively clear that we can extend this approach by simply applying this function 
to a couple of scenarios with randomly created link failures, too. Creating a corresponding 
extension of Dijkstra's algorithm, however, is less straightforward. 

Nature-inspired approaches have not only shown their efficiency in static optimization 
problems, but were proven to be especially robust in dynamic applications, too. This is 
particularly interesting in the looming age of networks of larger scale. Wireless networks 
[1707, 829, 1440], sensor networks [1012], wireless sensor networks [326]), Smart Home net- 
works [899, 897], ubiquitous computing [850, 1218], and more require self-organization, effi- 
cient routing, optimal parameter settings, and power management. We are sure that nature 
and physics-inspired global optimization methods will provide viable answers to many of 
these questions which will become more and more eminent in the near future. 

When condensing the essence of this summary down to a single sentence, "Evolutionary 
Computing in Telecommunications - A likely EC success story", the title of Kampstra's 
thesis, maybe fits best. However, we believe that the likely is no longer required, since many 
of the methods developed already reached engineering-level applicability. 
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Research Applications 



Research applications differ from real-world application by the fact that they have not yet 
reached the maturity to be applied in the mainstream of their respective area. They initiate 
a process of improvement and refinement, until we obtain solutions that are on par or at least 
comparable with those obtained by the traditional methodologies. Such a process can, for 
instance, be observed when following the progress in the area of Genetic Programming via 
the book series of Koza [1196, 1195, 1210, 1212]. On the other hand, research applications 
differ from toy problems because they are not intended to be used as sole demonstration 
examples or benchmarks but are first steps into a new field of application. 

The future of a research application is either to succeed and become a real-world appli- 
cation or to fail. In case of a failure, it may turn into a toy application where some certain 
features of global optimization methods like evolutionary algorithms can be tested. 

24.1 Genetic Programming of Distributed Algorithms 
24.1.1 Introduction 

Distributed systems are one of the most vital components of our economy. While many 
internet technologies, protocols, and applications grew into maturity and have widely been 
researched, new forms of networks and distributed computing have emerged. Amongst them, 
we can find wireless networks [1707, 829, 1440], sensor networks [1012] (and wireless sensor 
networks [326]), Smart Home networks [899, 897], ubiquitous computing [850, 1218], and 
ideas like amorphous computing [5, 313]. These distributed systems introduce new require- 
ments like self-adaptation to change in the environment (nodes may enter and leave the 
networks frequently) or change the priority of others (such as energy consumption). 

It may be a bold statement to say that such new requirements ask for new programming 
paradigms and future will shows whether it holds or not. Nobody will, however, argue 
that developing applications for these new distributed systems is surely to become more 
complicated than in traditional networks. Hence, exploring the utility of new programming 
methodologies (and new representations for algorithms especially tailored to them) is a 
demand of the current situation. 

The design of a distributed algorithm is basically the transformation of a specification 
of the behavior of a network on the global scale to a program that must be executed locally 
on each of the nodes of the network in order to achieve this behavior. Up to now, no general 
method for automating this process illustrated in Fig. 24.1. a has been developed and it is 
unlikely that this will change in near future. 

The transformation of global system behavior to local rules is no process specific only 
to distributed algorithm design. Matter of fact, a widely studied example for are swarming 
behaviors in nature [1723, 2033]. These behaviors have evolved for millions of years. By 
allowing many individuals of a species to travel together in a configuration which has a 
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Fig. 24.1. a: Design of distributed algorithms. 




Fig. 24.1.b: Evolutionary behavior design. 



Figure 24.1: Global — > Local behavior transformations. 



good volume/surface-ratio, they improve defense against predators, increase the chance of 
finding mating partners, enhance foraging success, improve hydro or aerodynamics and so on. 
Nature has evolved many efficient swarming behaviors, such as the shoaling of fish (depicted 
in Fig. 24.1.b), flocking of birds, herding of cows, and swarming of locusts. 

Evolutionary algorithms copy the evolutionary process itself for solving complex opti- 
mization problems [99, 821] and Genetic Programmingis the family of EAs which can be 
used for deriving programs [1196]. Here we will utilize it for breeding distributed algorithms 
- in other words - for transforming descriptions of global behaviors to local algorithms. 

These global descriptions are therefore encoded in objective functions, which rate "how 
close" the behavior of an evolved program x comes to the wanted one. In order to approxi- 
mate its quality, we execute x on nodes represented by virtual machines in a whole simulated 
network. As in reality, many of these VMs run asynchronously at approximately the same 
speed, which may differ from VM to VM and cannot be assumed to be constant either. For 
different problems, different topologies are simulated. 

We apply multi-objective Genetic Programming since it allows us to optimize the algo- 
rithms for different aspects during the evolution. While the functional objective functions 
perform the actual comparison of the observed behavior of the simulated network (running 
the evolved algorithms) with the desired global behavior, non-functional objective functions 
foster the economical use of resources, minimizing communication and program size, for 
instance. 

24.1.2 Evolving Proactive Aggregation Protocols 



In this section we discuss what proactive aggregation protocols are and how we can evolve 
them using a modified symbolic regression approach with Genetic Programming. 
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Aggregation Protocols 

Definition 24.1 (Aggregate). In computer science, an aggregate function 1 a : M. m i— ► R 
computes a single result a(x) from a set of input data x. This result represents some features 
of the input like its arithmetic mean. 

Other examples for aggregate functions are the variance and the number of points in 
the input data. In general, an aggregate 2 is a fusion of a (large) set of low-level data to one 
piece of high-level information. Aggregation operations in databases and knowledge bases 
[1284, 1595, 1152, 1309, 385, 551, 1904], be they local or distributed, for instance, have been 
an active research area in the past decades. Here, large datasets from different tables are 
combined to an aggregate by structured queries which need to be optimized for maximal 
performance. 

With the arising interest in peer-to-peer applications (see Section 30.2.2) and sensor 
networks (discussed in Section 30.2.2), a whole new type of aggregation came into existence 
in the form of aggregation protocols. These protocols are a key functional building block 
by providing the processes in such distributed systems with access to global information 
including network size, average load, mean uptime, location and description of hotspots, 
and so on [2099, 1048]. Robust and adaptive applications often require this local knowledge 
of such properties of the whole. If, for example, the average concentration of some toxin 
(which is aggregated from the measurements of multiple sensors in a chemical laboratory) 
exceeds a certain limit, an alarm should be triggered. 

In aggregation protocols, the data vector x is no longer locally available but its elements 
are spread all over the network. When computing the aggregate under these circumstances, 
we cannot just evaluate a. Instead, some form of data exchange must be performed by the 
nodes. This exchange can happen in two ways: either reactive or proactive. In a reactive 
aggregation protocol, one of the nodes in the network issues a query to all other nodes. Only 
this node receives the answer in form of the result (the aggregate) or the data needed to 
compute the result as illustrated in Fig. 24. 2. a. A proactive aggregation protocol, as sketched 
in Fig. 24. 2. b, on the other hand allows all nodes in the network to receive knowledge of 
the aggregate. This is achieved by repetitive data exchange amongst the nodes and iterative 
local refinement of the estimates of the wanted value. Notice that the trivial solution would 
be that all nodes send their information to all other nodes. Generally, this is avoided since 
it is not a viable approach and instead, the data is disseminated step by step as part of the 
estimates. 

Gossip-Based Aggregation 

Jelasity et al. [1048] propose a simple yet efficient type of proactive aggregation protocols 
[1123]. In their model, a network consists of many nodes in a dynamic topology where every 
node can potentially communicate with every other node. Errors in communication may 
occur, Byzantine faults not. The basic assumption of the protocol is that each node in the 
network holds one numerical value x. This value represents some information about the node 
or its environment, like, for example, the current work load. The task of the protocol is to 
provide all nodes in the network with an up-to-date estimate of the aggregate function a(x) 
of the vector of all values x = (x p , x q , . . . ) T . 

The nodes hold local states s (possibly containing x) which they can exchange via 
communication. Therefore, each nodes knows picks its communication partners with the 
getNeighbor() method. 

The skeleton of the gossip-based aggregation protocol is specified in Algorithm 24.1 and 
consists of an active and a passive part. Once in each S > time units, at a randomly 
picked time, the active thread of a node p selects a neighbor q. Both partners exchange their 

1 http://en.wikipedia.org/wiki/Aggregate_function [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/Aggregate_data [accessed 2007-07-03] 
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Fig. 24. 2. a: reactive aggregation 



Fig. 24. 2. b: proactive aggregation 



Figure 24.2: The two basic forms of aggregation protocols. 



information and update their states with the update method: p calls update(s p , s q ) in its 
active thread and q calls update(s 9 , s p ) in the passive thread, update is defined according 
to the aggregate that we want to be computed 



Algorithm 24.1: gossipBasedAggregationQ 



Data: p: the node running the algorithm 
Data: s p : the local state of the node p 

Data: s q ,s r : states received as messages from the nodes q and r 
Data: q,p,r: neighboring nodes in the network 

1 begin 

2 Subalgorithm activeThread 

3 begin 

4 while true do 

5 do exactly once in every 5 units at a randomly picked time: 

6 q < — getNeighbor() 

7 sendTo (q, s p ) 

8 s q < — receiveFrom(q) 

9 



10 

11 
12 
13 
14 
15 
16 

17 

18 end 



update(s p 



end 



Subalgorithm passiveThread 
begin 

while true do 

s r < — receiveAny() 
sendTo(getSender(s r ) , s p ) 



update(s p , s r 



end 



Example - Distributed Average 

Assume that we have built a sensor network measuring the temperature as illustrated in 
Figure 24.3. Each of our sensor nodes is equipped with a little display visible to the public. 
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Figure 24.3: An example sensor network measuring the temperature. 



The temperatures measured locally will fluctuate because of wind or light changes. Thus, 
the displays should not only show the temperature measured by the sensor node they are 
directly attached to, but also the average of all temperatures measured by all nodes. Then, 
the network needs to execute a distributed aggregation protocol in order to estimate that 
average. 

If we therefore choose a gossip-based average protocol, each node will hold a state variable 
which contains its local estimation of the mean. The update function, henceforth receiving 
the local approximation and the estimate of another node, returns the mean of its inputs. 

upd&te avg {s p ,s q ) = P 2 q (24.1) 

If two nodes p and q communicate with each other, the new value of s p and s q will be 
s p (t + 1) = s q (t + 1) = 0.5 * (s p (t) + s q (t)). The sum - and thus also the mean - of both 
states remains constant. Their variance, however, becomes and so the overall variance in 
the network gradually decreases. 




Fig. 24. 4. a: initial Fig. 24. 4. b: after step Fig. 24. 4. c: after step 

state 1 2 



Figure 24.4: An gossip-based aggregation of the average example. 



In order to visualize how that type of protocol works, let us assume that we have a 
network of four nodes with the initial values x = (5,6, 7, 8) T as illustrated in Fig. 24. 4. a. 
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The arithmetic mean of its elements is 

5 + 6 + 7 + 8 13 



4 2 6 ' 5 (242) 

The initial variance is 

(5 - 6.5) 2 + (6 - 6.5) 2 + (7 - 6.5) 2 + (8 - 6.5) 2 _ 5 

4 ~ 4 1 6) 

In the first step of the protocol, the nodes with the initial values 5 and 7 as well as the 
other two exchange data with each other and update their values to 6 and 7 respectively 
(see Fig. 24. 4. b). Now the average of all estimates is still 

^±I±l = 6.5 (24.4) 
but the variance has been reduced to 

\2 , /« a r\2 i In c r\2 , In a r\2 



(6 - 6.5) 2 + (6 - 6.5) 2 + (7 - 6.5) 2 + (7 - 6.5) 2 _ 



(24.5) 



After the second protocol step, outlined in Fig. 24. 4. c, all nodes estimate the mean with the 
correct value 6.5 (and thus, the variance is 0). The distributed average protocol is only one 
example of gossip-based aggregation. Others are: 

1. Minimum and Maximum. The minimum and maximum of a value in the network 
can be computed by setting update mi „(s p , s q ) — min {s p ,s q } and upd&te max (s p , s q ) = 
max {s pi s q } respectively. 

2. Count. The number of nodes in a network N can be computed using the average 
protocol: the initiator sets its state to 1 and all other nodes begin with 0. Then the 
average is computed is then n „mNodcs(N) ~ numNodes(N) wnere numNodes(N) is the 
number of nodes in N. The nodes now just need to invert the computed value locally 
and obtain i — = numNodes(N). 

numNodcs(N) 

3. Sum. The sum of all values in the network can be computed by estimating both, the 
mean value x and the number of nodes numNodes(N) in the network N simultaneously 
and multiplying both with each other: numNodes(N) x = ^x. 

4. Variance. As declared in Equation 28.61 on page 474, the variance of a data set is 
the difference of the mean of the squares of the values and the square of their means. 
Therefore, if we compute x 2 and x by using the average protocol, we can subtract them 
var(x) w x 2 — x 2 and, hence, obtain an estimation of the variance. 

Further considerations are required if x is not constant but changes by and by. Both, peer- 
to-peer networks as well as sensor networks, have properties (discussed in Section 30.2.2) 
which are very challenging for distributed applications and lead to an inherent volatility 
of x. According to Jelasity et al. [1048], a default approach to handle unstable data is 
to periodically restart the aggregation protocols. In our research, we were able to provide 
alternative aggregation protocols capable of dealing with dynamically changing data. This 
approach is discussed in Section 24.1.2 on page 414. 



The Solution Approach: Genetic Programming 

In order to derive certain aggregate functions automatically, we could modify the Genetic 
Programming approach for symbolic regression introduced in Section 23.1 on page 397 [2180, 
2187]. Let a : R m i— > K be the exact aggregate function. It works on a vector of the dimension 
m containing the data elements, m € N is not a predetermined constant but depends on the 
network size, i. e., m = numNodcs(N) and a will return exact results for m = 1, 2,3, ... . 
In Section 28.7.2 on page 503, we will show that the dimension m of the domain K m of a 
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plays no role when approximating it with a maximum likelihood estimator. The theorems 
used there are again applied in symbolic regression (see Equation 23.6 on page 399), so the 
value of m does not affect the correctness of the approach. Deriving aggregation functions for 
distributed systems, however, exceeds the capabilities of normal symbolic regression. Each 
of the to = numNodes(N) nodes in the network N holds exactly one element of the data 
vector, a cannot be computed directly anymore since it requires access to all data elements 
at once. Instead, each node has to execute local rules that define how data is exchanged 
and how an approximation of the aggregate value is calculated. How to find these rules 
automatically is the subject of our research here. There are three use cases for such an 
automated aggregation protocol generation: 

1. We may already know a valid aggregation protocol but want to find an equivalent proto- 
col which has advantages like faster convergence or robustness in terms of input volatility. 
This case is analogous to finding arithmetic identities in symbolic regression. 

2. We do not know the aggregate function a nor the protocol but have a set of sample 
data vectors Xj (maybe differing in dimensionality) and corresponding aggregates t/j. 
Using Genetic Programming, we attempt to find an aggregation protocol that fits to 
this sample information. 

3. The most probable use case is that we know how to compute the aggregate locally 
with a given function a but want to find a distributed protocol that does the same. 
We, for example, are well aware of how to compute the arithmetic mean of a data set 
(x\,X2, ■ ■ ■ ,x m ) - we just divide the sum of the single data items by their number to. 
If these items, however, are distributed and not locally available, we cannot simply sum 
them up. The correct solution described in Section 24.1.2 on page 416 is that each 
node starts by approximating the mean with its locally known value. Now always two 
nodes inform each other about their estimates and set their new approximation to be 
that mean of the old and the received on. This way, the aggregate is approached by 
itcratively refining the estimations. 

The transformation of the local aggregate calculation rule a to the distributed one is 
not obvious. Instead of doing it by hand, we can just use the local rule to create sample 
data sets and then apply the approach of the second use case. 

Network Model and Simulation 

For gossip-based aggregation protocols, Jelasity et al. [1048] assume a topology where all 
nodes can potentially communicate with each other. In this fully connected overlay network, 
communication can be regarded as fault-free. Taking a look at the basic algorithm scheme 
of such protocols introduced as Algorithm 24.1 on page 416, we see that the data exchange 
happens once every S time units at a randomly picked point in time. Even though being 
asynchronous in reality, it will definitely happen in this time span. That is, we may simplify 
the model to a synchronous network model where all communication happens simultaneously. 

Another aspect of communication is how the nodes select their partners for the data 
exchange. It is a simple fact that the protocol can only converge to the correct value if each 
node has, maybe over multiple hops and calculations, been able to receive information from 
all other nodes. Imagine a network N consisting of to = numNodes(N) = 4 nodes p, q, r, 
and t. If the communication partners are always (p,q) and (r, t), the data dissemination is 
insufficient since p will never be able to incorporate the knowledge of the states of r and t. 
On the other hand, one data exchange between q and r will allow the protocol to work since 
p would later on indirectly receive the required information from q. 

Besides this basic fact, Jelasity et al. [1048] have shown that different forms of pair selec- 
tion influence the convergence speed of the protocol. Correct protocols will always converge 
if complete data dissemination is guaranteed. Knowing that, we should choose a partner 
selection method that leads to fast convergence because we then can safe protocol steps in 
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the evaluation process. The pair building should be deterministic, because randomized selec- 
tion schemes lead to slow convergence [1048], and, more importantly, will produce different 
outcomes in each test and make comparing the different evolved protocols complicated (as 
discussed in Section 1.3.4 on page 55). Therefore, choosing a deterministic selection scheme 
seems to be the best approach. Perfect matching according to Jelasity ct al. [1048] means 
that each node is present in exactly one pair per protocol cycle, i. e., always takes part in 
the data exchange. If different pairs are selected in each cycle, the convergence speed will 
increase. It can further be increased by selecting (different) pairs in a way that disseminates 
the data fastest. 
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Figure 24.5: Optimal data dissemination strategies. 
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From these ideas, we can derive a deterministic pair selection mechanism with best-case 
convergence. Therefore, we first need to set the number of nodes in the simulated network 
numNodes(N) = m = 2 d as a power of two. In each protocol step t with t = 1, 2, . . . , we 
compute a value A = 2* mod d . Then we build pairs in the form (i, i + A), where i and i + A 
are the indices identifying the nodes. This setup equals a butterfly graph and is optimal, as 
you can see in Fig. 24. 5. a. The data from node (marked with a thick border) spreads in 
the first step to node 1. In the second step, it reaches node 2 directly and node 3 indirectly 
through node 1. Remember, if the average protocol would use this pair selection scheme, 
node 3 would compute its new estimate at step 2 since 

* f+ i \ i „ Ai -, \ s 3 (t=0)+s 2 (t=0) s o (t=0)+si(t=0) 

ss(t = 2) = "3(*-l) + «i(*-l) + ± =1 (24.6) 

In the third protocol step, the remaining four nodes receive knowledge of the information 
from node and the data is disseminated over the complete network. Now the cycle would 
start over again and node would communicate with node 1. 

This pair selection method is bounded to networks of the size m = 2 d . We can generalize 
this approach by breaking up the strict pair-communication limitation. Therefore, we set 
d = [log 2 m] while still leaving A = 2* mod d and define that a node i sends its data to the 
node (i + A) mod m for all i as illustrated in Fig. 24. 5. b. This general communication rule 
abandons the strict pair-based data exchange but leaves any other feature of the aggregation 
protocols, like the working of the update method, untouched. We should again visualize that 
this rule is only defined so we can construct simulations where the protocols need as few 
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as possible steps to converge to the correct value in order to spare us computation time. 
Another important aspect also becomes obvious here: The time that an aggregation protocol 
needs to converge will always depend on the number of nodes in the (simulated) network. 

Node Model and Simulation 

As important as modeling the network is the model of the nodes it consists of. In Figure 24.6, 
we illustrate an abstraction especially suitable for fast simulation of aggregation protocols. 
A node p executing a gossip-based aggregation protocol receives input in form of the locally 
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Figure 24.6: The model of a node capable to execute a proactive aggregation protocol. 



known value (for example a sensor reading) and also in form of messages containing data 
from other nodes in the network. The output of p consists of the local approximation of the 
aggregate value and the information sent to its partners in the network. The computation 
is done by a processor which updates the local state by executing the update function. The 
local state s p of p can most generally be represented as vector s p <E R" of the dimension n, 
where n is the number of memory cells available on a node. 

Like Jelasity et al. [1048], we until now have considered the states to be scalars. Gener- 
alizing them to vectors allows us to specify or evolve more complicated protocols. The state 
vector contains the approximation of the aggregate value at the position i : 1 < i < n. If the 
state only consists of a single number, the messages between two nodes will always include 
this single number and hence, the complete state. 

A state vector not only serves as a container for the aggregate, but also as memory 
capable of accumulating information. It is probably unnecessary or unwanted to exchange 
the complete state during the communication. Therefore, we specify an index list e containing 
the indices of the elements to be sent and a list r with the indices of the elements that shall 
receive the values of the incoming messages. For a proper communication between the nodes, 
the length of e and r must be equal and each index must occur at most once in e and also at 
most once in r. Whenever a node p receives a message from node q, the following assignment 
will be done, with s p [i] being the i th th component of the vector: 

s p [r 3 ] < — s q [e 3 ] Vj = 0. . .len(r) - 1 (24.7) 

In the original form of gossip-based aggregation protocols, the state is initialized with a 
static input value which is stepwise refined to approximate the aggregate value [1048]. In our 
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model, this restriction is no longer required. We specify an index / pointing at the element 
of the state vector that will receive the input. This allows us to grow protocols for static 
and for volatile input data. In the latter case, the inputs are refreshed in each protocol step. 
A node p would then perform 

s p [i](t) < — getlnput(p,i) (24.8) 

The function getlnput(p, t) returns the input value of node p at time step t. With this 
definition, the state vectors s p become time-dependent, written as s p (t). Finally, update is 
now designed as a map M" M™ to return the new state vector. 

s p (t + 1) = update(s p (i)) (24.9) 

In the network simulation, we can put the state vectors of all nodes together to a single 
n x to matrix S(t). The column k of this matrix contains the state vector s k of the node k. 



S(t) = (si,s 2 ,...,s rn ) (24.10) 

sum = s k [j] (24.li) 

This notation is used in Algorithm 24.2 and Algorithm 24.3. In Algorithm 24.2 we spec- 
ify how the model definitions just discussed can be used to build a network simulation for 
gossip-based, proactive aggregation protocols. Here we also apply the general optimal com- 
munication scheme explained in Section 24.1.2. In the practical realization, we can spare 
creating a new matrix S(t) in each time step t by initial using two matrices Si, S2 which 
we simple swap in each turn. 



Evaluation and Objective Values 

The models described before are the basis of the evaluation of the aggregation protocols 
that we breed. In general, there are two functional features that we want to develop in the 
artificial evolution: 

1. We want to grow aggregation protocols where the deviation between the local estimates 
and the global aggregate is as small as possible, ideally 0. 

2. This deviation can surely not be after the first iteration at t = 1, because the nodes 
do not know all data at that time. However, the way how received data is incorporated 
into the local state of a node can very well influence the speed of convergence to the 
wanted value. Therefore, we want to find protocols that converge as quickly as possible. 

In all use cases discussed in Section 24.1.2, we either already know the correct aggregation 
values yi or the local aggregate function a : R m > R that calculates them from data vectors 
of the length to. The objective is to find a distributed protocol that computes the same 
aggregates in a network where the data vector is distributed over to nodes. In our model, the 
estimates of the aggregate value can be found at the positions S[0, ■]* = Sfe[0] Vfc <G [1 ... to] 
in the state matrix or the state vectors respectively. 

The deviation e(k,t) of the local approximation of a node k from the correct aggregate 
value y(t) at a point in time t denotes its estimation error. 

y(t) = a^(gctInput(l,t),..,gctInput(m,t)) T ^ (24.12) 

s(k, t) = y(t) - S[0,k](t) = y(t) - s fc [0] (24.13) 

We have already argued that the mean square error is an appropriate quality function for 
symbolic regression (see Equation 23.6). Analogously, the mean of the squares of the errors s 
over all simulated time steps and all simulated nodes is a good criterion for the utility of an 
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Algorithm 24.2: simulateNetwork(m, T) 



Input: m: the number of nodes in the simulation 

Input: T: the maximum number of simulation steps 

Input: [implicit] update: the update function 

Input: [implicit] /: the index for the input values 

Input: [implicit] O: the index for the output values 

Input: [implicit] e: the index list for the send values 

Input: [implicit] r: the index list for the receive values 

Data: d: communication step base according to Fig. 24. 5. b on page 420 

Data: k: a node index 

Data: S(t): the simulation state matrix at time step t 
Data: A: the communication partner offset 
Data: p: the communication partner node 

begin 

d < — [log 2 m] 

5(0) < — createMatrix(n x m) 
II initialize with local values 

S(0)[- ,k] < — getlnputfeO 

t < — 1 

while t<Tdo 

S(t) < — copyMatrix(S(t - 1)) 

^ c^t mod d 

II perform communication according to Fig. 24. 5. b on page 420 

k < 1 

while k < m do 

p < — (k + A) mod m 

foreach j G [l..len(r)] do S(t)[ rj , P ] < — S(t - l)[ ej ,fc] 
k < — k + 1 



l 

2 
3 

4 
5 
6 
7 
8 

9 
10 
11 
12 
13 

14 
15 
16 
17 

18 
19 



// set (possible) new input values and perform update 

k < 1 

while k < m do 

S(t)[i,k] < — getlnput(fc, t) 
S(t)[ij,k] < — update(S(£)[- ,fc]) 
// = s fc (t) < — update(s fe (t)) 
k < — k + 1 



t 



t + 1 



20 end 



aggregation protocol. It is even related to both functional aspects subject to optimization: 
The larger it is, the greater is the deviation of the estimates from the correct value. If the 
convergence speed of the protocol is low, these deviations will become smaller more slowly 
by time. Hence, the mean square error will also be higher. For any evolved update function 
u we define 3 : 



h{u,e,r)= ^-^^e(Mr 



t=i k=i 



u,e,r (24.14) 



This rather mathematical definition is realized indirectly in Algorithm 24.3, which returns 
the value of /i for an evolved update method u. It also applies the fast, convergence-friendly 
communication scheme discussed in Section 24.1.2. Its realization in the Distributed Genetic 
Programming Framework [2177] software allows us to evaluate even complex distributed 
protocols in very short time: A protocol can be tested on 16 nodes for 300 protocol steps 
less than 5 milliseconds on a normal, 3GHz off-the-shelf PC. 



where • |w, e, r means "passing u, e, r as input to Algorithm 24.3" 
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Algorithm 24.3: fi(u,e, 



evaluateAggregationProtocol(it, m, T) 



Input: 
Input: 
Input: 
Input: 
Input: 
Input: 
Input: 
Input: 



10 

ll 

12 

13 
14 

15 
16 
17 



u: the evolved protocol update function to be evaluated 
m: the number of nodes in the simulation 
T: the maximum number of simulation steps 
[implicit] update: the update function 
[implicit] 7: the index for the input values 
[implicit] O: the index for the output values 
[implicit] e: the index list for the send values 
[implicit] r: the index list for the receive values 
Data: d: communication step base according to Fig. 24. 5. b on page 420 
Data: k: a node index 

Data: S(t): the simulation state matrix at time step t 

Data: A: the communication partner offset 

Data: p: the communication partner node 

Data: res: the variable accumulating the square errors 

Output: fi(u,e,r): the sum of all square errors (deviations from the correct aggregate) over 
all time steps 

begin 

d < — [log 2 m] 

5(0) < — crcateMatrix(n x m) 
II initialize with local values 

S(0)[- ,k] < — getlnputfeO 

t < — 1 

while i<Tdo 

S(t) < — copyMatrix(S(t - 1)) 

// perform communication according to Fig. 24. 5. b on page 420 



// set (possible) new input values and perform update 

k < 1 

while k < m do 

S(t)[i,k] < — getlnput(fc, t) 

II u is the evolved update-function and thus, used here 

S(t)[-, fc ]«— u(S(t)[-,k]) 
res < — res + (y(t) — S(t)[o,k]) 2 
II = res < — res + (a(i(t)) - S{t)[o,k]) 2 
k< — k + 1 



t< — t + 1 
return res 



18 end 



Input Data 

In Algorithm 24.3 we use sample a values in order to determine the errors s. In two of our 
initial use cases, we need to create these values before the evaluation process, either with 
an existing protocol or with a known aggregate function a. Here we will focus on the latter 
case. 

If transforming a local aggregate function a to a distributed aggregation protocol, we 
need to create sample data vectors for the getlnput(fc, t)-method. Here we can differentiate 
between static and dynamic input data: for static input data, we just need to create the 
samples for t = since getlnput(fc, 0) = gctlnput(fc, 1) = . . . agGetlnputbkT Vfc £ [l..n]. If 
we have dynamic inputs on the other hand, we need to ensure that at least some elements of 
the input vectors x(f) = getlnput • t will differ, i. e., 3ti, t% : x(ti) ^ x(t 2 ). If this difference 
is too large, an aggregation protocol cannot converge. It should be noted that it would 
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be wrong to assume that we can measure this difference in terms of the sample data x 



restrictions like 0.9 < 



](*) 



< 1.1 are useless, because their impact on the value of a is 



X[i](t + 1) 

unknown. Instead, we must limit the variations in terms of the aggregation results, like 



0.9 < 



a(x(t)) 



a(x(t)) 



< 1.1 



(24.15) 



In both, the static and the dynamic case, we need to create multiple input datasets, dis- 
tinguished by adding a dataset index to x(t): x(l, t), x(2, t), . . . , x(Z, t). Only if a(x(l, t)) ^ 
a(x(2, t)) =/=■■■=/= ax(Z,i) we can assure that the result are not just overfitted protocols 
that simple always return one single learned value: the exact a of the sample data. The 
single x(i, t) should differ in magnitude, sign, and distribution since this will lead to large 
differences in the a- values: 



a(x(i, t)) 



a(x(j,t)) 



< 1 V 



a(x{i,t)) 



a(x(j,t)) 



» 1 V i± j 



(24.16) 



We use z such data sets to perform z runs of Algorithm 24.3 and compute the true value 
of fx as arithmetic mean of the single results. 



1 z 

fi(u,e,r) = -'y* under C 'onditionj 'f(u, e, r)xj 



(24.17) 



»=i 



Of course, for each protocol that we evaluate we will use the same sample data sets be- 
cause otherwise the results would not be comparable. It should be noted that ovcrfitting 
can furthermore be prevented by changing the sample vectors in each generation. In this 
experiment series, generating the test data was too time consuming so we did not apply this 
measure. 



Volatile Input Data 

The specification of getlnput(fc, t) which returns the input value of node k at time t G [0..T] 
allows us to evolve aggregation protocols for static and such for volatile input. Traditional 
aggregation protocols are only able to deal with constant inputs [1048]. These protocols 
have good convergence properties, as illustrated in Fig. 24. 7. a. They always converge to the 
correct results but will simple ignore changes in the input data (see Fig. 24. 7. b). 

They would need to be restarted in a real application from time to time in order to provide 
up-to-date approximations of the aggregate. This approach is good if the input values in the 
real application that we evolve the protocols for change slowly. If these inputs are volatile, the 
estimations of these protocols become more and more imprecise. The fact that an aggregation 
protocol needs a certain number of cycles to converge is an issue especially in larger or mobile 
sensor networks. One way to solve this problem is to increase the data rate of the network 
accordingly and to restart the protocols more often. If this is not feasible, because of, for 
example, energy restrictions in a low-power sensor network application prohibit increasing 
the network traffic, dynamic aggregation protocols may help. 

They represent a sliding average of the approximated parameter and are able to cope with 
changing input data. In each protocol step, they will incorporate their old state, the received 
information, and the current input data into the calculations. A dynamic distributed average 
protocol like the one illustrated in Figure 24.8 is a weighted sum of the old estimate, the 
received estimate, and the current value. The weights in the sum can be determined by the 
Genetic Programming process according to the speed with which the inputs change. In order 
to determine this speed for the simulations, a few real sample measurements would suffice 
to produce customized protocols for each application situation. However, the incorporation 
of the current input value is also the drawback of such an approach, since it cannot fully 
converge to the correct result anymore. 
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Figure 24.7: The behavior of the distributed average protocol in different scenarios. 
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Figure 24.8: A dynamic aggregation protocol for the distributed average. 



Phenotypic Representations of Aggregation Protocols 

We have to find a proper representation for gossip-based aggregation protocols. Such a pro- 
tocol consists of two parts: the evolved update function and a specification of the properties 
of the state vector - the variables /, O, r, and e. 

Representation for the update Function 

The function update as defined in the context of our basic model for aggregation protocols 
receives the state vectors s^t) £ 1™ of time step t as input. It returns the new state vectors 
Sk(t + 1) € M" of time step t + 1. This function is indeed an algorithm by itself which can 
be represented as a list of tuples / = (..., (v,j, Vj) , . . . ) of mathematical expressions Uj and 
vector element indices vj. This list I is processed sequentially for j = 1, 2, . . . , len(7). In each 
step j, the result of the expression Uj is computed and assigned to the Vj th element of the old 
state vector s(t — 1). In the simplest case, I will have the length len(Z) = 1. One example for 
this is the well-known distributed average protocol illustrated in Figure 24.9: In the single 
formula, the first element of s[i](t), [1], is assigned to 0.5 * ([1] + [2]) which is the average 
of its old value and the received information. Here, the value of the first element is send to 
the partner and the received message is stored in the second element, i. e., r = [2], e = [1]. 
The terminal set of the expressions now does not contain the simple variable x anymore but 
all elements of the state vectors. Finally, after all formulas in the list have been computed 
and their return values are assigned to the corresponding memory cells, the modified old 
state vector Sfc(i) becomes the new one sk(t + 1). Fig. 24. 9. b shows a more complicated 
protocol where update consists of len(Z) = 4 formulas ((u\, 1) , («2, 2) , (1*3, 3) , (1*4, 2)). We 
will not elaborate deeper on these examples but just note that both are valid results of 
Genetic Programming - a more elaborate discussion of them can be found in Section 24.1.2 
on page 430, Section 24.1.2 on page 431, and [2187, 2180]. 
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Figure 24.9: Some examples for the formula series part of aggregation protocols. 



The important point is that we are able to provide a notation for the hrst part of the 
aggregation protocol specification that is compatible to normal symbolic regression and 
which thus can be evolved using standard operators. 

Besides this sequence of formulas computed repetitively in a cycle, we also need an 
additional sequence that is executed only once, in the initialization phase. It is needed for 
some other protocols than the distributed minimum, maximum, and average, which cannot 
assume the approximation of the estimate to be the current input value. Here, another 
sequence of instructions is needed which transforms the input value into an estimate which 
then can be exchanged with other nodes and used as basis for subsequence calculations. This 
additional sequence is evolved and treated exactly in the same way as the set of formulas 
used inside the protocol cycle. 

Experiments have shown that it is useful though to initialize all state elements in the 
first time step with the input values. Therefore, both Algorithm 24.2 and Algorithm 24.3, 
initially perform iS (())[•, ft] < — getlnput(fc, 0) instead of S(0)[i,k] < — getlnput(fc, 0). In all 
other time steps, only S(t)[i,k] is updated. 

Straightforwardly, we can specify a non-functional objective function ji that returns the 
number of expressions in both sets and, hence, puts pressure into the direction of small 
protocols with less computational costs. 

Representation for I , O, e, and r 

Like the update function, the parameters of the data exchange, r and e, become subject to 
evolution. / and O are only single indices; we can assume them to be fixed as I = 1 and 
= 2. Allowing them to be changed will only result in populations of many incompatible 
protocols. Although we could do the same with e and r, there is a very good reason to make 
them variable. If e and r are built during the evolutionary process, different protocols with 
different message lengths (len(ei) ^ len(e2)) can emerge. Hence, we can introduce a non- 
functional objective function that puts pressure into the direction of minimal message 
lengths. The results of Genetic Programming will thus be optimal not only in accuracy of 
the results but only in terms of communication costs. 

For the lists e and r there are three possible representations. We can use either a bit string 
of the fixed length 2n which contains two bits for each element of s: the first bit determines 
if the value of the element should be sent, the second bit denotes if an incoming element 
should be stored there. String genomes of a fixed length are explained in detail in Section 3.4 
on page 147. By doing so, we implicitly define some restrictions on the message structure 
since we need to define an order on the elements inside. If n = 4, a bit string 01011010 
will be translated into e = (3,4) and r = (1,2). It is not possible to obtain something like 
e = (3,4) and r = (2,1). 

The second encoding scheme is to use two variable-length integer strings which represent 
e and r directly. Such genomes are introduced in Section 3.5 on page 149. Now the latter 
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case becomes possible. If the lengths of the two strings differ, for example for reproduction 
reasons, the length of the shorter one is used solely. 

The third approach would be to, again, evolve one single string z. This string is composed 
of pairs z = ((ei, ri) , (e 2 , r 2 ) , . . . , (rj, rj)). The second and the third approach are somewhat 
equivalent, 

In principle, all three methods are valid and correct since the impossibility of some 
message structures in the first method does not necessarily imply that certain protocol 
functionality cannot evolve. The standard reproduction operators for string genomes, be 
they of fixed or variable length, can be applied. 

When we closely examine our abstract protocol representation, we will see that it will 
work with epidemic [1047] or SPIN-based [914] communication too, although we developed 
it for a gossip-based communication model. 

Reproduction Operators 

As already pointed out when elaborating on the representation schemes for the two parts of 
the aggregation protocols, well-known reproduction operators can be reused here. 

1. The formulas in the protocol obey strictly a tree form, where the root always has two 
child nodes, the formula sequences for the protocol cycle and the initialization, which, in 
turn, may have arbitrarily many children: the formulas themselves. A formula is a tree 
node which has stored one number which identifies the vector element its results will be 
written to. It has exactly one child node, the mathematical expression which is a tree of 
other expressions. We elaborate on tree-shaped genomes in Section 4.3 on page 162. 

2. The communication behavior is described as cither one fixed- length bit string or two 
variable-length integer strings. 

New protocols are created by first building a new formula tree and then combining it with 
one (or two, according to the applied coding scheme) newly created string chromosomes. We 
define the mutation operation as follows: If an aggregation protocol is mutated, with 80% 
probability its formula tree is modified and with 20% probability its message pattern. When 
performing a recombination operation, a new protocol is constructed by recombining the 
formula tree as well as the message definition of both parents with the default means. 

Results from Experiments 



In Table 24.1, we list the configuration of the Genetic Programming algorithm applied to 
breed aggregation protocols. 



Parameter 


Short 


Description 


Problem 


X 


The space of aggregation programs, (sec Section 24.1.2) 


Space 






Objective 


F 


F = {/i,/2,/3>, see Algorithm 24.3, Section 24.1.2, Section 24.1.2 


Functions 






Search Space 


G 


(basically) identical with the problem space, i. e., G = X. 


Search 


Op 


mutation and crossover (see Section 24.1.2) 


Operations 






GPM 


gpm 


not needed 


Optimization 


alg 


elitist Genetic Programming 


Algorithm 






Comparison 


cm 


Cmp F agg (see Equation 24.18) 


Operator 
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Population 


ps 


ps = 4096 


Size 






Maximum 


as 


The size of the archive with the best known individuals was limited 


Archive Size 




to as = 64. (see Definition 2.4) 


Steady-State 


ss 


The algorithms were generational (not steady-state) (ss = 0). (sec 






Section 2.1.6) 


Fitness 


fa 


For fitness assignment in the evolutionary algorithm, Pareto rank- 


Assignment 




ing was USed. (sec Section 2.3.3) 


Algorithm 






Selection 


sel 


A binary (k = 2) tournament selection was applied, (see Section 2.4.4) 


Algorithm 






Convergence 


cp 


No additional means for convergence prevention were used, i. e., 


Prevention 




Cp = 0. (see Section 2.4.8) 


Number of 


tc 


The number of training cases used for evaluating the objective func- 


Training 




tions were tc = 22, where each run is granted 28 cycles in the static 


Cases 




and 300 cycles in the dynamic case. 


ct 


ct 


The training cases were fixed, i. e., ct = 0. 



Table 24.1: The settings of the Aggregation- Genetic Programming experiments. 

In the simulations, 16 virtual machines were running, each holding a state vector s with 
five elements. From all experiments, those with a tiered prevalence comparison performed 
best and, hence, will be discussed in this section. Tiered prevalence comparison is similar to a 
Pareto optimization which is performed level-wise. When comparing two solution candidates 
X\ and X2, initially, the objective values of the first objective function j\ are considered only. 
If one of the solution candidates has here a better value than the other, it wins. If both 
values are equal, we compare the second objective values in the same way, and so on. The 
comparator function defined in Equation 24.18 gives correctness (/i) precedence to protocol 
size (/ 2 ). Its result indicates which of the two individuals won - a negative number denotes 
the victory of x\, a positive one that X2 is better. The tiered structure of cmjp Fagg leads to 
optimal sets with few members that most often (but not always) have equal objective values 
and only differ in their phenotypes. 

-1 if (Afci) </i(ar 2 ))V 

((/i(^i) = /i(^))A(/ 2 ( a;i )</ 2 ( a;2 )))V 
((/i(*i)=/i(*2)) A (/ 2 (zi)=/ 2 (z 2 ))A 
(/s(zi) < h{x 2 ))) 
1 if {h{x 2 ) <h{ Xl ))y (24.18) 
((/i(ar 2 )=/i(a:i))A(/2(a ;2 )</2(a;i)))V 

((/l(*2)=/l(*l)) A (/2(Z2)=/2(S1))A 
[h{X2) < f 3 (xi))) 

otherwise 

We do not need more than five memory cells in our experiments. The message size was 
normally one or two in all test series and if it was larger, it converged quickly to a minimum. 
The objective function that minimizes it thus shows no interesting behavior. It can be 
assumed that it will have equal characteristics like / 2 in larger problems. 

Average - static 

With this configuration, protocols for simple aggregates like minimum, maximum, and av- 
erage can be obtained in just a few generation steps. We have used the distributed aver- 
age protocol which computes a avg = x in many of the previous examples, for instance, 
in Section 24.1.2 on page 416, Section 24.1.2 on page 425, and in Fig. 24. 9. a. The evolution 
of a static version such an algorithm is illustrated in Figure 24.10. The graphic shows how 



cmp Fia9g (xi,a; 2 ) = < 
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Figure 24.10: The evolutionary progress of the static average protocol. 



the objective values of the first objective function (the mean square error sum) improve 
with the generations in twelve independent runs of the evolutionary algorithm. All runs did 
converge to the optimal solution previously discussed, most of them very quickly in less than 
50 generations. 

Figure 24.11 reveals the inter-relation between the first and second objective function for 
two randomly picked runs. Most often, when the accurateness of the (best known) protocols 
increases, the number of formula expressions also rises. These peaks in / 2 are always followed 
by a recession caused by stepwise improvement of the protocol efficiency by eliminating 
unnecessary expressions. This phenomenon is rooted in the tiered comparison that we chose: 
A larger but more precise protocol will always beat a smaller, less accurate one. If two 
protocols have equal precision, the smaller one will prevail. 

Root-Of- Average - static 

In our past research, we used the evolution of the root-of-average protocol as benchmark 
problem [2180]. Here, a distributed average protocol for the aggregate function a ra is to be 
evolved: 

a rQ (x) = Vxex (24.19) 

One result of these experiments has already been sketched in Fig. 24. 9. b. Figure 24.12 
is a plot of eleven independent evolution runs. It also shows a solution found after only 
84 generations in the quickest experiment. The values of the first objective function /i, 
denoting the mean square error, improve so quickly in all runs at the beginning that a 
logarithmic scale is needed to display them properly. This contrasts with the simple average 
protocol evolution where the measured fitness is approximately proportional to the number 
of generations. The reason is the underlying aggregate function which is more complicated 
and thus, harder to approximate. Therefore, the initial errors are much higher and even 
small changes in the protocols can lead to large gains in accurateness. 





Figure 24.12: The evolutionary progress and one grown solution of the static root- of- average 
protocol. 
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The example solution contains a useless initialization sequence. In the experiments, it 
paradoxically did not vanish during the later course of the evolution although the secondary 
(non-functional) objective function / 2 puts pressure into the direction of smaller protocols. 
For the inter-relation between the first and second objective function, the same observations 
can be made than in the average protocol. Improvements in /i often cause an increase in 
fi which is followed by an almost immediate decrease, as pictured in Figure 24.13 for the 
84-generation solution. 




Figure 24.13: The relation of f\ and fi in the static root- of- average protocol. 



Average - dynamic 

After verifying our approach for conventional aggregation protocols with static input data, 
it is time to test it with dynamically changing inputs. This may turn out be a useful ap- 
plication and is more interesting, since creating protocols for this scenario by hand is more 
complicated. 

So we first repeat the "average" experiment for two different scenarios with volatile input 
data. The first one is depicted with solid lines in Figure 24.14. Here, the true values of the 
aggregate a(x(i)) can vary in each protocol step by 1% and in one simulation by 50% in 
total. In the second scenario, denoted by dashed lines, these volatility measures are increased 
to 3% and 70% respectively. The different settings have a clear impact on the results of the 
error functions - the more unsteady the protocol inputs, the higher will /i be, as Figure 24.14 
clearly illustrates. The evolved solution exhibits very simple behavior: In each protocol step, 
a node first computes the average of its currently known value and the new sensor input. 
Then, it sets the new estimate to the average of this value and the value received from its 
partner node. Each node sends its current sensor value. This robust basic scheme seems to 
work fine in a volatile environment. The course of the evolutionary process itself has not 
changed significantly. Also the interactions between of f\ and fi stay the same, as shown in 
Figure 24.15. 

Root- Of- Average - dynamic 

Now we repeat the second experiment, evolving a protocol that computes the square root of 
the average, with dynamic input data. Here we follow the same approach as for the dynamic 
average protocol: Tests are run with the same two volatility settings as in Section 24.1.2. 
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Figure 24.14: The evolutionary progress of the dynamic average protocol. 




Figure 24.15: The relation of /i and in the dynamic average protocol. 
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Figure 24.16 shows how /i changes by time. Like in Figure 24.12, we have to use a logarithmic 
scaling for /i to illustrate it properly. For the tests with the slower changing data (solid 
lines), an intermediate solution is included because the final results were too complicated 
to be sketched here. The evolutions with the highly dynamic input dataset however did not 
yield functional aggregation protocols. From this we can follow that there is a threshold of 
volatility from which on Genetic Programming is no longer able to breed stable formulas. 
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Figure 24.16: The evolutionary progress and one grown solution of the dynamic root-of- 
average protocol. 



The relation of fi and /2, outlined in Figure 24.17, complies with our expectations. In 
every experiment run, increasing f\ is usually coupled to a deterioration of fa which means 
that the protocol formulas become larger and include more sub-expressions. This is followed 
by a recreation span where the formulas are reduced in size. After a phase of rest, where the 
new protocol supposable spreads throughout the population, the cycle starts all over again 
until the end of the evolution. 



Conclusions 



In this chapter, we have illustrated how Genetic Programming can be utilized for the au- 
tomated synthesis of aggregation protocols. The transition to the evolution of protocols for 
dynamically changing input data is a step towards a new direction. Especially in applica- 
tions like large-scale sensor networks, it is very hard for a software engineer to decide which 
protocol configuration is best. With our evolutionary approach, different solutions could be 
evolved for different volatility settings which can then be selected by the network according 
to the current situation. 
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Introduction 



Today, there exist many different optimization frameworks. Some of them are dedicated 
to special purposes like spacecraft design [755] or trading systems [2302]. Others provide 
multi-purpose functionality like GALib [2140], Evolutionary Objects (EO) [1114, 1336] or 
the Java evolutionary computation library (EC J) [1327]. 

In this part of the book, we want to introduce a new approach in global optimization 
software, called Sigoa, the Simple Interface for Global Optimization Algorithms 1 . Based 
on this library, we want to demonstrate how the optimization algorithms discussed in the 
previous chapters can be implemented. 

We decided to use Java [837, 838, 317] as programming language and runtime system for 
this project since it is a very common, object-oriented, and platform independent. You can 
find more information on Java technology either directly at the website http://java.sun. 
com/ [acceded 2oo7-o7-o3] or in books like 

1. Javabuch by Kriiger [1217], the best German Java learning resource in my opinion, is 
online available for download at http://www.javabuch.de/ [accessed 2007-07-03]. 

2. For the English reader, Thinking in Java by Eckel [618] would be more appropriate - its 
free, third edition is online available at http://www.mindview.net/Books/TIJ/ [accessed 

2007-07-03]. 

3. As same as interesting are the O'Reilly books Java in a Nutshell [687] and Java Examples 
in a nutshell [686] by Flanagan, and Learning Java by Niemeyer and Knudsen [1534]. 

4. Java ist auch eine Insel by [2071] - another good resource written in German, is also 
online available at http://www.galileocomputing.de/openbook/javainsel6/ [accessed 

2007-07-03]. 

The source code of the binaries and the source files of the software described in this 
book part can be found online at http://www.sigoa.org/. It is not only open-source, 
but licensed very liberally under the LGPL (see appendix Chapter B on page 581) which 
allows for the integration of Sigoa into all kinds of systems, from educational to commercial, 
without any restrictions. Sigoa has features that aim to support optimizing complicated 
types of individuals which require time-consuming simulation and evaluation procedures. 

Genetic Programming of real distributed algorithms is one example for such problems. 
In order to determine such an algorithm's fitness, you need to simulate the algorithm 2 . The 
evolution progresses step by step, so at first, we will not have any algorithm that works 
properly for a specified problem. Some of the solution candidates whatsoever will be able to 
perform some of the sub-tasks of the problem, or will maybe solve it partly. Since they may 
work on some of the inputs while failing to process other inputs correctly, a single simulation 
run will not be sufficient. We rather execute the algorithms multiple times and then use the 
minimum, median, average, or maximum objective values encountered. In the case of growing 

http : //www . sigoa. org/ 
2 See Section 4.10 on page 219 for further discussions. 
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distributed algorithms, it is again not sufficient to simulate one processor. Instead, we need to 
simulate a network of many processors in order to determine the objective values 5 . Hence, 
it is simple to imagine that such a process may take some time. There are many other 
examples of optimization problems that involve complicated and time-consuming processes 
or simulations. 

A framework capable of partially dealing with such aspects in an elegant manner has 
already been developed by the author in the past, see Weise [2175], Weise and Geihs 
[2176, 2177]. With the Sigoa approach, we use our former experiences to create a software 
package that has a higher performance and is way more versatile: One of the key features of 
Sigoa is the separation of specification from implementation, which allows heavyweight im- 
plementations as required for the evolution of the distributed algorithms as well as creating 
lightweight optimization algorithms which do not need simulations at all - like numerical 
minimization or such and such. This clear division not only allows for implementing all the 
optimization algorithms introduced in the previous parts but is good basis for including 
new, experimental methods that may have not been discussed yet. 

Before starting with the specification of the Sigoa approach, we performed a detailed 
study on different global optimization methods and evolutionary algorithms. Additionally, 
we used the lessons learned from designing the DGPF system to write down the following 
major requirements: 

25.1 Requirements Analysis 

25.1.1 Multi-Objectivity 

Almost all real-world problems involve contradicting objectives. A distributed algorithm 
evolved should, for example, be efficient and yet simple. It should consume not much memory 
and involve as little as possible communication between different processors but on the other 
hand should ensure proper functionality and be robust against lost or erroneous messages. 
The first requirement of an optimization framework is thus multi-objectivity. 

25.1.2 Separation of Specification and Implementation 

It should easily be possible to adapt the optimization framework to other problems or prob- 
lem domains. The ability to replace the solution candidate representation forms is therefore 
necessary. Furthermore, the API must allow the implementation of all optimization algo- 
rithms discussed in the previous chapters in an easy and elegant manner. It should further be 
modular, since most of the optimization algorithms also consist of different sub-algorithms, 
as we have seen for example in Chapter 2 on page 95. 

From this requirement we deduce that the software architecture used for the whole frame- 
work should be component based. Each component should communicate with the others only 
through clearly specified interfaces. This way, each module will be exchangeable and may 
be even represented by proxies or such and such, granting a maximum of extensibility. If 
we define a general interface for selection, we could modify the SPEA-algorithm (see ?? on 
page ??) which originally uses tournament selection to use another selection algorithm. 

Hence, we will define Java-interfaces for all parts of optimization algorithms such as 
fitness assignment or clustering methods used for pruning the optimal sets. By doing so, we 
reach a separation of the specification from the implementation. For all interfaces we will 
provide a reference implementation which can easily be exchanged, allowing for different 
levels of complexity in the realizations. 



3 In Section 24.1.2 on page 414 you can find good example for this issue. 
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25.1.3 Separation of Concerns 

An optimization system consists not only of the optimization algorithms themselves. It needs 
interfaces to simulators. If it is distributed, there must be a communication subsystem. Even 
if the optimization system is not distributed, we will most likely make use of parallelism since 
the processors inside of nowadays off-the-shelf PCs already offer supportive hyper-threading 
or dual-core technology [570, 1891]. If Sigoa is utilized by multiple other software systems 
which transfer optimization tasks to it, security issues arise. These aspects are orthogonal to 
the mathematical task of optimizing and should therefore be specified at different places and 
clearly be separated from the pure algorithms. Best practice commands to already consider 
such aspects in the earliest software design phase of every project and thus, also in the Sigoa 
library. 

25.1.4 Support for Pluggable Simulations and Introspection 

In most real-world scenarios, simulations are needed to evaluate the objective values of the 
solution candidates. If we use the framework for multiple problem domains, we will need 
to exchange these simulations or even want to rely on external modules. In some cases, 
the value of an objective function is an aggregate of everything what happened during 
the simulation. Therefore, they need a clear insight into what is going. Since we separate 
the objective functions from the simulations by clearly specified interfaces (as discussed in 
Section 25.1.3), these interfaces need to provide this required functionality of introspection. 

In the use case of evolving a distributed algorithm, we can visualize the combination 
with the separation of concerns and introspective simulations: Besides working correctly, a 
distributed algorithm should use as few messages as possible or at least has stable demands 
considering the bandwidth on the communication channel. We therefore could write an 
objective function which inspects the number of messages underway in the simulation and 
computes a long-term average and variance. The simulation itself then does not need to be 
aware of that; it simple has to offer the functionality of counting the messages currently 
in transmission. The catch is that we can now replace the objective function by another 
one that maybe puts the pressure a little bit differently, leading to better results, without 
modifying the simulation. On the other hand, we can also use different simulation models 
for example one where transmission errors can occur and one where this is not the case - 
without touching the objective function. 

25.1.5 Distribution utilities 

As already said, there are many applications where the simulations are very complicated and 
therefore, our architecture should allow us to distribute the arising workload to a network of 
many computers. The optimization process then can run significantly faster because many 
optimization techniques (especially evolutionary algorithms) are very suitable for parallel 
and distributed execution as discussed in Chapter 18 on page 299. 

25.2 Architecture 

We want to design the Sigoa optimization system based on these requirements. In this 
book part, we have assigned different chapters to the classes of different components of 
Sigoa and their sub-algorithms. By specifying interfaces for all aspects of optimization and 
implementing them elsewhere, the versatility to exchange all components is granted, so 
customized optimizers can be built to obtain the best results for different problem domains. 
Furthermore, interfaces allow us to implement components in different levels of detail: there 
may be applications where the evaluation of objective functions involves massive simulations 
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(like genetic programming) and applications, where the simple evaluation of mathematical 
functions enough (like numerical minimizing) . In the latter case, using a system that provides 
extended support for simulations may result in performance degeneration since a lot of 
useless work is performed. If the mechanism that computes the objective values could be 
exchanged, an optimal approach can be used in each case. 

As result from these considerations, we divide the Sigoa architecture in org . sigoa into two 
main packages: org. sigoa. spec contains the specifications and org. sigoa. refimpl areference 
implementation. Figure 25.1 illustrates this top-level package hierarchy. 



org 

«Library» 



sfc 

«Library» 



^<Import» 



sigoa 

«Package» 



refimpl 

«Package» 



«Import» 



spec 

«Package» 



Figure 25.1: The top-level packages of the Sigoa optimization system. 



The specification of the functionality of Sigoa is given by interfaces and a few basic 
utility classes only. It is independent from any library or other software system and does 
not require prerequisites. The interfaces can therefore also be implemented as wrappers that 
bridge to other, existing optimizing systems. Most of these specification interfaces inherit 
from java.io.Serializable and hence can be serialized using the Java Object Serialization 
mechanisms 4 . This way, we provide the foundation for creating snapshots of a running 
optimization algorithm which would allows for starting, stopping, restarting, and migrating 
them. 

The reference implementation uses an additional software package called Sfc, the Java 
Software Foundation Classes - a LGPL-liccnsed open-source library available under the same 
URL as Sigoa that provides some useful classes for tasks that are needed in many applications 
like enhanced 10, XML support, extended and more efficient implementations of the Java 
Collection Frameworks-interfaces and so on. This utility collection is not directly linked to 
optimization algorithms but provides valuable services that ease the implementation of the 
Sigoa components. 

The package hierarchy of the reference implementation is identical to the one of the 
specifications. The package org. sigoa. spec. gp. reproduction, for example, contains the def- 
inition of mutation and crossover operations whereas the package org. sigoa. refimpl. gp 
.reproduction contains the reference implementation of these operations. 

25.3 Subsystems 

As illustrated in Figure 25.2, the Sigoa framework is constituted by nine subsystems: 

4 http://java.sun.eom/javase/6/docs/technotes/guides/serialization/ [accessed 2007-07-03] 

5 http://java.sun.eom/javase/6/docs/technotes/guides/collections/ [accessed 2007-07-03] 
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Figure 25.2: The subsystem specification of the optimization framework. 



1. The adaptation package contains mechanisms that allow components to adapt them- 
selves to a given situation based on rules. This can be used for example by optimization 
algorithms in order to adjust their parameters. A very simple application is the termi- 
nation criterion : a rule could be defined that terminates an algorithm after a given 
time. 

2. In the clustering-package, interfaces for clustering-algorithms (as defined in Chapter 29 
on page 535) are specified. Clustering algorithms can be used to reduce a large set of 
solution candidates to a smaller one without loss of generality. 

3. One way for optimization algorithms to report their status and statistics to the outside 
world would be via events. As already said, we do not treat the optimization process as a 
mere mathematical procedure - it will always be part of some application. As such, not 
only the final results are interesting but also status messages and statistic evaluations 
of its progress. The events package defines interfaces for events that can be generated 
and may contain such information. 

4. The largest subsystem is the go package, where all components and sub-algorithms for 
global optimization are specified. Here you can find the interface specifications that cover 
the all the algorithmic and mathematical functionality of global optimization algorithms. 

5. In the jobsystem package, we place the specification of the means to run optimization 
algorithms. An optimizer may be parallelized or run sequentially and therefore may use 
multiple threads. The algorithm itself should be separated from the parallelism issues. 
Applying the definitions of the jobsystem package, optimizers may divide their work into 
parallelizable pieces which can be executed as jobs. Such jobs are then handled by the job 
system, which decides if they should be run in different threads or performed sequentially 
This way, it also possible to manage multiple optimization algorithms in parallel and to 
specify which one will be assigned to how many processors. The implementations of the 

6 see Section 1.3.4 on page 54 
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job system specifications could also perform accounting and performance monitoring of 
the work load. 

6. The concept of pipes defined in the pipe package is a very mighty approach for realizing 
optimization. It does not only allow separating the different components of an optimizer 
completely - totally new components, like statistic monitors can also easily be inserted 
into a system with minimum changes. 

7. The job system enables Sigoa to handle multiple optimization requests at once. Since it is 
a plain component interface, these requests may come from anywhere, maybe even from 
a web service interface built on top of it. It must somehow be ensured that such requests 
do not interfere or even perform harmful or otherwise malicious actions. Therefore, a 
security concept is mandatory. In the security package we specify simple interfaces that 
build on the Java Security Technology 7 . 

8. The behavior of solution candidates is often simulated in order to determine their ob- 
jective values. The simulation package provides interfaces that specify how simulations 
can be accessed, made available, and are managed. 

9. Stochastic evaluations are a useful tool for optimization algorithms. As already said, the 
application using the Sigoa system may regularly need information about the progress, 
which normally can only be given in form of some sort of statistical evaluation. This 
data may also be used by the optimization algorithms themselves or by adaptation 
rules. Furthermore, most the global optimization methods discussed here are randomized 
algorithms. They thus need random number generators as introduced in Section 28.9 on 
page 526. 



7 http://java.sun.eom/javase/6/docs/technotes/guides/security [accessed 2007-07-03] 
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Examples 



But before we are going into detail about the different packages and utilities of the Sigoa 
software, we will present some application examples. These give a straightforward insight 
into the usage and customization of the Sigoa components which most probably is good 
enough to apply them to other problems. A more specific discussion of the Sigoa packages 
following after this chapter then rounds up the view on this novel optimization system. 

26.1 The 2007 DATA-MINING-CUP 

As an application example for genetic algorithms, the 2007 Data-Mining-Cup Contest, has 
been introduced in Section 22.1.2 on page 374. We strongly recommend reading this section 
first. We there have discussed the basic principles behind the challenge and the structure 
of one possible solution to it. Here we will only show how these ideas can be realized easily 
using the Sigoa framework. 

The objective of the contest is to classify a set of 50 000 data vectors containing 20 
features (from which only 17 are relevant) each into one of the three groups A, B, and 
N. In order to build classifiers that do so, another 50 000 datasets with already known 
classifications are available as training data. Thus, let us start by representing the three 
possible classification results using a simple Java enum like in Listing 26.1. 

Our approach in Section 22.1.2 was to solve the task using a modified version of Learning 
Classifier Systems C. For the contest, a function P(C) denoting the profit that could be 
gained with a classifier C was already defined (see Equation 22.1). Thus, we simple strip the 
LCSs from their learning capability and directly maximize the profit directly. 

26.1.1 The Phenotype 

The problem space X was thus composed of mere classifier systems, the phenotypes of a 
genetic algorithm. They consist of a set of rules m_rules (the single classifiers) which we can 

public enum EClasses { 

/** class A */ 
A, 

/** class B */ 
B, 

/** class N */ 

N; 

} 

Listing 26.1: The enum EClasses with the possible DMC 2007 classifications. 
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represent as byte arrays containing the conditions and rule outputs m_results (instances of 
EClasses). 

Listing 26.2 illustrates how the method classify works which classifies a set of 17 rele- 
vant features (stored in the array data) into one of the three possible EClasses instances. It 
iterates over all the rules in m_rules and checks if rule m_rules [i] fits perfectly according to 
the definitions in Table 22.2 on page 378. If so, the corresponding classification m_results [i] 
is returned, classify keeps also track on the rule which has the fewest violated conditions 
in the variables lec and leci. If no perfectly matching rule could be found, the ^-threshold 
mentioned in Section 22.1.2 is checked: if lec <= 3, the classification m_results [leci] be- 
longing to the rule with the least violations is returned. Otherwise, the class N represented 
by EClasses. N is assigned to the data sample. 

So this is basically what a phenotype can look like in Sigoa - you can clearly see that, 
except from implementing java.io.Serializable, no further requirements are imposed on its 
structure. The method classify is not mandatory, it is will just be part of the evaluation in 
this particular optimization problem; other problems may need totally other functionality. 

26.1.2 The Genotype and the Embryogeny 

The genotype that belongs to the phenotypic individual representations is a variable-length 
a bit string. Such genomes have been discussed in Section 3.5 on page 149 extensively. In 
Figure 22.4, we have introduced the genotype-phcnotype mapping in this particular appli- 
cation: since there are four possible conditions and 17 conditions plus three possible classi- 
fications (A, B, and N) per rule, we need 17 * 2 + 2 = 36 bits to encode a single classifier 
which is the granularity , i. e., the gene size of our genome. A classifier system in turn may 
consist of an arbitrary number of such classifiers. 

In Sigoa, we can represent variable-length as well as fixed-length bit strings as byte arrays 
(byte [] ) for which predefined creation, mutation, and crossover operators exist. Therefore, 
we do not have to deal with the reproduction operations directly and can concentrate on the 
translation of a genotype g £ byte [] into a corresponding phenotype which is an instance of 
Classif ierSystem. Such translations is called genotype-phenotype mapping (see Section 3.8 
on page 155) or artificial embryogeny (discussed in Section 3.8) for which Sigoa offers a core 
interface IEmbryogeny (see ?? on page ??) and a reference implementation Embryogeny (see ?? 
on page ??) along with an extension for bitstrings, BitStringEmbryogeny (see ?? on page ??) 
which provides special streams for input and output of structured data from and to bit 
strings. We simply need to extend this class by providing (at least) the transformation func- 
tion gpm from genotypes to phcnotypes and (optionally) vice versa. Listing 26.3 shows this 
extension in form of the class Classif ierEmbryogeny. The constant CLASSIFIER_EMBRYOGENY 
provides a globally shared instance of this new embryogeny. 

26.1.3 The Simulation 

So now we need to find out how an evolved classifier system C behaves. Therefore we can 
use the provided test datasets or better, only a good share of them while saving the rest 
in order to check if our classifier system generalize well. For these training sets, we built a 
matrix M(C) where the columns denote the classification results delivered by C and the 
rows contain the true classes. For determining the zero-based indices into this matrix we 
use the method ordinal () of the EClasses enum, i. e., m 2j i would represent those elements 
in class N that were miss-classified into group B - 2799 in the example matrix M ex of 
Equation 26.1. From M ex , we can furthermore read that 1087 of the samples belonging to 
class B were correctly classified whereas 777 were assigned to class A and 1462 to class N. 




(26.1) 
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i public class Clas s if ier System extends JavaTextable implements 



Serializable { 

2 ... 

3 private final byte[][] m_rules; 

4 private final EClasses [] m_results; 

5 ... 

e public Classif ierSystem (final byte [] [] rules, final EClasses [] 
results) { 

7 super () ; 

8 this . m_results = results; 

9 this.m_rules = rules; > 
10 ... 

n public EClasses class if y ( f inal byte [] data) { 

12 byte [] [] rules ; 

13 byte [] rule ; 

14 int i, j, ec , lec , leci; 

15 

16 rules = this . m_rules ; 

17 lec = Integer . MAX_VALUE ; 
is leci = 0; 

19 

20 main: for (i = (rules . length - 1); i >= ; i--) { 

21 rule = rules [i] ; 

22 ec = 0; 

23 for (j = (rule. length - 1); j >= 0; j--) { 

24 switch (rule[j]) { 

25 case 0: { 

26 if (data[j] != 0) 

27 if ((++ec) > 3) continue main; 

28 break; 

29 } 

30 case 1 : { 

31 if (data[j] < 1) // .'= 1 

32 if ((++ec) > 3) continue main; 

33 break; 

34 } 

35 case 2: { 

36 if (data[j] <= 1)// <= 0) 

37 if ((++ec) > 3) continue main; 

38 break ; 

39 } 

40 } 
} 

42 if (ec <= 0) return this . m_results [i] ; 

43 if(ec<lec){ 

44 lec = ec ; 

45 leci = i ; 

46 } 
} 

48 if (lec <= 3) return thi s . m_r esult s [ leci ] ; 

49 return EClasses. N; 

so y 

51 ... 



52 } 

Listing 26.2: The structure of our DMC 2007 classifier system. 
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public class Clas s if ier Embry ogeny extends 
Bit St r ingEmbryogeny <ClassifierSystem> { 
/** the classes */ 

private static final EClasses [] CLASSES = 
/** The globally shared instance */ 

public static final IEmbryogeny <byte [] , Class if ier System > 
CLASSIFIER_EMBRY OGENY = new Clas s if ierEmbry ogeny () ; 



EClasses . values () ; 



/** This method is supposed to compute an instance of 

* the phenotype from an instance of the genotype . 

* @param genotype The genotype instance to breed a 

* phenotype from. 

* ©return The phenotype hatched from the genotype . */ 
©Override 

public Classif ierSystem hat ch ( f inal byte [] genotype) { 
int i , j , c ; 

byte [] [] rules ; 

byte [] rule ; 

EClasses [] results; 
BitStr inglnputStream bis ; 

if (genotype == null) 

throw new NullPo int erExcept ion ( ) ; 
c = ( (genotype . length * 8) / 36); 

rules = new byte [c] [17] ; 
results = new EClasses [c] ; 

bis = this . acquireBitStringlnputStream () ; 

bis . init (genotype) ; 

for (i = 0; i < c; i++) { 
rule = rules [i] ; 
for (j = 0; j < 17; j++) { 

rule[j] = (byte) (bis . readBits (2) ) ; 

} 

results [i] = ( CLASSES [bis . readBit s (2) 7. 3]); 

} 

this.releaseBitStringlnputStream(bis); 
return new Clas s if ier System ( rules , results); 



Listing 26.3: The embryogeny component of our DMC 2007 contribution. 



From such matrices, we can easily compute the profit function P{C) as well as other 
features, like how many As, Bs, and Ns were classified incorrectly. What we basically do 
here is to simulate the behavior of the classifiers. And for simulations, Sigoa provides the in- 
terface ISimulation (see ?? on page 11) and its standard implementation Simulation (see ?? 
on page 11). This default implementation just needs to be extended so it uses the train- 
ing samples, which we load somewhere else (in a class called Datasets), and computes M. 
Therefore, overriding the method beginlndividual is sufficient and other changes are not 
needed. 

Listing 26.4 shows the most important code of the new class Classif icationSimulation. 
In order to allow us to publish the new simulation in the simulation manager of the optimiza- 
tion job system, we also provide a globally shared factory in form of an ISimulationProvider- 
instance with the constant PROVIDER in line 3. 
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public class Clas sif i cat i onSimulat ion extends 
Simulat ion < Clas s if ier System > { 
/** the shared provider for this simulation */ 
public static final ISimulationProvider PROVIDER = new 
SimulationProvider(Classificat i onS imulat ion . class) ; 
/** the matrix M(C) */ 

private final int [] [] m_class if i cat ions ; 

public Clas s if icat ionSimulat ion ( ) { 
super () ; 

this ^.classifications = new int [3] [3] ; > 

/** Here the matrix M(C) is computed 

* Sparam what The classifier C to be simulated. */ 
©Override 

public void beginlndi vidual ( f inal Clas s if ier System what) 
{ 

int i ; 

int [] [] x = this ^.classifications; 

super . beginlndividual (what ) ; 

for (i = (x. length - 1); i >= 0; i--) 

Arrays . fill (x [i] , 0); 
for (i = (DATA. length - 1); i >= ; i--) 

x [CLASSES [i] . ordinal () ] [ what . classify (DATA [i] ) . ordinal ()]++; 

} 

/** Compute the profit value P(C) . */ 
public int getProfitO { 

int [] [] data = this ^.classifications ; 

return (3 * data [0] [0] ) + (6 * data [1] [1] ) - (data [2] [0] + 
data [2] [1] + data [0] [1] + data [1] [0] ) ; 

} 

} 

Listing 26.4: The simulation for testing the DMC 2007 classifier system. 



26.1.4 The Objective Functions 

On the foundation of the new simulation for classifier system, we can define the objective 
functions that should guide the evolution. In Section 22.1.2 on page 379 we have introduced 
the two most important objective functions: one that minimizes /i(C) = —P(C) and hence, 
maximizes the profit, and /2(C) = \C\ which minimizes the number of rules in the classifier 
system. 

All objective functions in Sigoa are instances of the interface IObjectiveFunction (see ?? 
on page ??). They can be derived from its default implementation ObjectiveFunction (see ?? 
on page ??) which implements the basic functionality so only the real mathematical compu- 
tations need to be added. 

In Listing 26.5, we implement f\. Therefore, the method endEvaluation needs to be 
overridden. Here we store negated profit into a state record which is used by the optimization 
system to compute the objective value and to store it into the individual records later on. 

The only remaining question is: How will the optimizer system know that our objective 
function needs an instance of Classif icationSimulation and that it has to call its method 
beginlndividual beforehand? The answer is relatively simple: In line 3, we have defined an 
instance of SimulationProvider for the Classif icationSimulation. This provider will later 
be introduced to the optimization job system. It uses Classif icationSimulation. class as 
identifier per default. With the method getRequiredSimulationld on line 16, we tell the 
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public class Pr of it Ob j ect i veFunct i on extends 

Ob j ect iveFunct ion <Classif ierSyst em , Obj ectiveState , Serializable , 
Classif icationSimulation> { 

/** This method is called after any simulation/ 

* evaluation is performed . It stores the negated 

* profit — P(C) into the state-variable - that's all.*/ 
©Override 

public void endEvaluat ion ( f inal Clas s if ier System individual, final 
Obj ectiveState state, final Serializable staticState, final 
Clas s if i cat i onSimulat ion simulation) { 
state . setObjectiveValue (-simulation . getProfit () ) ; 

> 

/*# 

* Obtain the ID of the required simulator . 

* Oreturn The ID= class of our simulator */ 
©Override 

public Serializable get Requir edS imulat ionld ( ) { 
return ClassificationSimulation . class ; 

} 

} 

Listing 26.5: The profit objective function /i(C) = -P(C) for the DMC 2007. 



public class Size Ob j e ct iveFunct ion extends 

Ob j ect iveFunct ion < Class if ierSy st em , Obj ectiveState , Serializable , 
ISimulat ion < Classif ier Sy st em >> { 
/** This method is called after any simulation/ 

* evaluation is performed . It stores the size of 

* the \ClassS\ \C\ into the state- 

* variable - that's all. */ 
©Override 

public void endEvaluat ion ( f inal Clas s if i er System individual, final 
Obj ectiveState state, final Serializable staticState, final 
ISimulat ion < Class if ier Syst em > simulation) { 
state . setObjectiveValue (Math . max (indivi dual. getSizeO, 3)); 

> 

} 

Listing 26.6: The size objective function f 2 (C) = \C\ for the DMC 2007. 



job system that we need a simulation which is made available by an provider with exactly 
this ID. Before passing the simulation to our objective function, the job system will call its 
beginlndividual method which, in turn, builds the matrix M(C) holding the information 
needed for its getProfit method. Now we can query the profit from this simulation. 

For the secondary objective function / 2 defined in Listing 26.6, we do not need any 
simulation. Instead, we directly query the number of rules in the classifier system via the 
method getSize. In Listing 26.2, we have omitted this routine for space reasons, it simply 
returns m_rules . length. Again, this value is stored into the state record passed in from the 
evaluator component of the job system which will then do the rest of the work. 
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26.1.5 The Evolution Process 

Now the work is almost done, we just need to start the optimization process. Listing 26.7 
presents a main-method which is called on startup of a Java program and does so. There- 
fore, we first have to decide which global optimization algorithm should be used and pick 
ElitsitEA 1 , an elitist evolutionary algorithm (per default steady-state) with a population 
size of 10 * 1024 and mutation and crossover rates of 0.4 in line 8. 

Then we construct an IOptimizationlnf o-record with all the information that will guide 
the evolution 2 . Part of this information is how the solution candidates should be evaluated. 
For this, we use an instance of Evaluator' 5 (line 15) which is equipped with a List con- 
taining the two new objective functions from 10 and 12. We furthermore tell the system 
to perform a pure Pareto-optimization as discussed in Section 1.2.2 on page 31 by passing 
the globally shared instance of ParetoComparator 4 (line 16) into the info record. We then 
define that our embryogeny component should be used to translate the bit string genotypes 
into Classif ierSystem phenotypes in line 17. These genotypes are produced by the default 
reproduction operators for variable-length bit string genomes 5 added in lines 18 to 20. All of 
them are created with a granularity of 36 which means that it is ensured that all genotypes 
have sizes of multiples of 36 bits and that all crossover operations only occur at such 36 bit 
boundaries. 

After this is done, we instantiate SimulationManager'' and publish the new simulation that 
we have developed in Section 26.1.3 on page 446 by adding its provider to the simulation 
manager in line 27. The job system created in line 28 allows the evaluator to access the simu- 
lation manager, an instance of the interface ISimulationManager 7 . The evaluator will then ask 
its objective functions which simulations they need - in our case a Classif icationSimulation 
- and then query the simulation manager to provide them. 

In line 28, we decided to use a multi-processor job system which is capable of trans- 
parently parallelizing the optimization process. The different types of job systems which 
are instances of the interface IJobSystem specified in ?? on page ?? are discussed in ?? on 
page ??. We add our optimizer to the system in line 29 and finally start it in 30. Since we 
have added no termination criterion, the system will basically run forever in this example. 

In order to get information on its progress, we have provided two special output pipes 
(see ?? on page ??) in lines 23 and 24 to the optimizer's non-prevalence pipe. Through this 
pipe, in each generation all non-prevailed (in our case, non-dominated) individuals will be 
written and thus, pass our two pipes. In each generation, new text files with information 
about them are created. The first one, the IndividualPrinterPipe, uses the directory c and 
creates files that start with a c followed by the current generation index. It writes down 
the full information about all individuals. From these files, we can later easily reconstruct 
the complete individuals and, for instance, integrate them into the real applications. The 
second printer pipe, an instance of ObjectivePrinterPipe, stores only the objective values in 
a comma-separated-values format. The output files are put into the directories bo and also 
start with bo followed by the generation index. Such files are especially useful for getting 
a quick overview on how the evolution progresses. Later, they may also read into spread 
sheets or graphical tools in order to produce fancy diagrams like Figure 22.5 on page 380. 

1 see ?? on page ?? 

2 IOptimizationlnf o is discussed in ?? on page ??, its reference implementation 
Optimizationlnf o in ?? on page ??. 

The class Evaluator, discussed in ?? on page ??, is the default implementation of the interface 
IEvaluator specified in ?? on page ??. 

4 The class ParetoComparator, elaborated on in ?? on page ??, implements the interface 
IComparator defined in ?? on page ??. 

5 These operations are introduced in ?? on page ?? implement the interfaces ICreator, IMutator, 
and ICrossover specified in ?? on page ??. 

6 see ?? on page ?? 

7 see ?? on page ?? 
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public static void main ( String [] args) { 
EA<byte[], Clas s if ier System > opt; 

IOptimizationlnf o <byte [] , Clas s if ier System > oi ; 
I JobSystem s ; 
SimulationManager m; 

List < IObj ect iveFunct ion <Classif ierSystem , ?, ?, 
IS imulat ion <ClassifierSystem>>> 1; 

opt = new ElitistEA <byte [] , Clas s if ier System >( 10 * 1024, 0.4d, 
0.4d) ; 

1 = new ArrayList < I Ob j ect iveFunct ion < Clas s if ier System , ?, ?, 

I S imulat ion <ClassifierSystem>>>() ; 
l.add(new ProfitObjectiveFunctionQ) ; 
1. add(new SizeObjectiveFunctionO ) ; 

oi = new Optimizationlnf o <byte [] , Clas s if ier System > ( 
new Evaluat or < Clas s if ier System >( 1 ) , 
ParetoComparator . PARET0_C0MPARAT0R , 
Classif ierEmbryogeny . CLASS IF IER_EMBRYOGENY , 
new VariableLengthBitStr ingCreator (36) , 
new VariableLengthBitStr ingMutator (36) , 
new VariableLengthBitStringNPointCrossover (36) , 
null) ; 

opt . addNonPrevailedPipe (new IndividualPr interPipe <byte [] , 

Classif ierSystem > (new FileTextWriterProvider(new 

File ("c") , "c") , false)); 
opt . addNonPrevailedPipe (new ObjectivePrinterPipe<byte[] , 

Classif ierSystem > (new FileTextWr iterProvider (new FileC'bo") 

"bo"), false)); 

m = new SimulationManager () ; 

m. addProvider ( Clas s if i cat ionS imulat ion . PROVIDER) ; 
s = new Mult iPr oce s sor JobSy st em (m) ; 
s.executeOptimization(opt, new JobInfo<byte[], 

ClassifierSystem>(oi)) ; 
s . start () ; 

} 

Listing 26.7: A main method that runs the evolution for the 2007 DMC. 
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Set Theory 



Set theory 1 [550, 880, 1967] is an important part of the mathematical theory. Numerous 
other disciplines like algebra, analysis and topology are based up on it. Since set theory 
(and the topics to follow) is not the topic of this book but a mere prerequisite, this chapter 
(and the ones to follow) will just briefly introduce it. We make heavy use of wild definitions 
and in some cases even use roughly cut stuff short. More information on the topics discussed 
can be retrieved from the literature references. 

Set theory can be divided into naive set theory 2 and axiomatic set theory 5 . The first 
approach, the naive set theory, is inconsistent and therefore not regarded in this book. 

Definition 27.1 (Set). A set is a collection of objects considered as a whole 4 . The objects 
of a set are called elements or members. They can be anything, from numbers and vectors, 
to complex data structures, algorithms, or even other sets. Sets are conventionally denoted 
with capital letters, A, B, C, etc. while their elements are usually referred to with small 
letters a, 6, a 

27.1 Set Membership 

The expression a E A means that the element a is a member of the set A while y £ A means 
that y is not a member of A. There are three common forms to define sets: 

1. With their elements in braces: A = {1, 2, 3} defines a set A containing the three elements 
1, 2, and 3. {1, 1, 2, 3} = {1, 2, 3} since the curly braces only denote set membership. 

2. The same set can be specified using logical operators to describe its elements: Vfo € N : 
(b > 1) A (b < 4) <S> b e B. 

3. A shortcut for the previous form is to denote the logical expression in braces, like C = 
{(c> 1) A (c < 4),c e N}. 

The cardinality of a set A is written as \A\ and stands for the count of elements in the 

set. 



27.2 Relations between Sets 

Two sets A and B are said to be equal, written A = B, if they have the same members. 
They are not equal (A ^ B) if either a member of A is not an element of B or an element 

1 http://en.wikipedia.org/wiki/Set_theory [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/Naive_set_theory [accessed 2007-07-03] 

3 http://en.wikipedia.org/wiki/Axiomatic_set_theory [accessed 2007-07-03] 

4 http : //en. wikipedia. org/wiki/Set_°/ ( 28mathematics°/ 29 [accessed 2007-07-03] 
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of B is not a member of A. If all elements of the set A are also elements of the set B, A 
is called subset of B and B is the superset of A. We write A c B if A is a (true) subset of 
but not equal to B. A £ B means the A is a subset of £? and may be equal to B. If A is no 
subset of but may be equal to B, A (f_ B is written. A % B means that A is neither a subset 
of nor equal to B. 



A = B = 


x e A x £ B 


(27.1) 


A^B = 


(Bx : x £ A A x <£ B) V (By : y £ B A y £ A) 


(27.2) 


A£B = 


x £ A => x £ B 


(27.3) 


A c B = 


AC B ABy : y £ B Ay A 


(27.4) 


A%B = 


Bx : x £ AAx ^ B 


(27.5) 


A(^B = 


(A = B) V (Bx : x £ A A x <£ B) 


(27.6) 



27.3 Special Sets 

Special sets used in the context of this book are 

1. The empty set = {} contains no elements (|0| = 0). 

2. The natural numbers 5 N include all whole numbers bigger than 0. (N = {1, 2, 3, ..}) 

3. The natural numbers including (No) include all whole numbers bigger than or equal 
to 0. (N = {1,2, 3,..}) 

4. Z is the set of all integers, positive and negative. (Z = {.., —2, — 1, 0, 1, 2, ..}) 

5. The rational numbers'' Q are defined as Q = { f : a, b £ Z, b ^ 0}. 

6. All real numbers 7 K include all rational and rational numbers (such as y/2). 

7. R + denotes the positive real numbers including 0: (R + = [0,oo)). 

8. C is the set of complex numbers 8 . When needed in the context of this book, we abbreviate 
the immaginary unit with i, and the real and imaginary parts of a complex number z 
with Rez and Imz. 



NcNoCZcQcMcC (27.7) 
NcNoCM+cMcC (27.8) 

For these numerical sets, special subsets, so called intervals, can be specified. [1,5) is a 
set which contains all the numbers starting from (including) 1 up to (exclusively) 5. (1,5] 
on the other hand contains all numbers bigger than 1 and up to inclusively 5. In order to 
avoid ambiguities, such sets will always used in a context where it is clear if the numbers in 
the set are natural or real. 



27.4 Operations on Sets 

Let us now define the possible unary and binary operations on sets, some of which are 
illustrated in Figure 27.1. 

5 http://en.wikipedia.org/wiki/Natural_numbers [accessed 2008-01-28] 

6 http://en.wikipedia.org/wiki/Rational_number [accessed 2008-01-28] 

7 http://en.wikipedia.org/wiki/Real_numbers [accessed 200S-01-28] 

8 http://en.wikipedia.org/wiki/Complex_number [accessed 2008-01-29] 
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AoB 




Figure 27.1: Set operations performed on sets A and B inside a set 



Definition 27.2 (Set Union). The union 9 C of two sets A and B is written as AU B and 
contains all the objects that are element of at least one of the sets. 



C = A U B ((c G A) V (c G B) & (c G (7)) 

AU0 = A 
AUi = A 
A C AU B 



(27.9) 
(27.10) 
(27.11) 
(27.12) 
(27.13) 



Definition 27.3 (Set Intersection). The intersection 10 D of two sets A and B, denoted 
by AOB, contains all the objects that are elements of both of the sets. If AnB = 0, meaning 
that A and B have no elements in common, they are called disjoint. 



D = An B & ((d G A) A (d G B) & (d G £>)) 

An0 = 
AnA = A 
AnBCA 



(27.14) 
(27.15) 
(27.16) 
(27.17) 
(27.18) 



Definition 27.4 (Set Difference). The difference E of two sets A and B, A\B, contains 
the objects that are element of A but not of B. 



E = A\ B & ((e G A) A (e g" B) ^ (e G £)) 

0\A = 
A\ A = 



(27.19) 
(27.20) 
(27.21) 
(27.22) 
(27.23) 



http : //en. wikipedia. org/wiki/Union_'/ 28set_theory'/,29 [accessed 2007-07-03] 
10 http://en.wikipedia.org/wiki/Intersection_yo28set_theoryyo29 [accessed 2007-07-03] 
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Definition 27.5 (Set Complement). The complementary set A of the set A in a set A 

includes all the elements which are in A but not element of A: 

AC A^I = A\ i (27.24) 

Definition 27.6 (Cartesian Product). The Cartesian product 11 P of two sets A and B, 
denoted P = A x B is the set of all ordered pairs (a, b) whose first component is an element 
from A and the second is an element of B. 

P = Ax B P = {(a,b) : ae A,b € B} (27.25) 
A n = Ax Ax ..x A (27.26) 

n times 

Definition 27.7 (Countable Set). A set S is called countable 12 if there exists an injective 
function 13 3/ : S ^ N. 

Definition 27.8 (Uncountable Set). A set is uncountable if it is not countable, i. e., no 
such function exists for the set. N, Z, and Q are countable, R and R + are not. 

Definition 27.9 (Power Set). The power set 14 V(A) of the set A is the set of all subsets 
of A. 

VpeP(A)opCA (27.27) 



27.5 Tuples 

Definition 27.10 (Type). A type is a set of values that a variable, constant, function, or 
similar entity may take on. 

We can, for instance, specify the type T = {1, 2, 3}. A variable x which is an instance of 
this type then can take on the values 1, 2, or 3. 

Definition 27.11 (Tuple). A tuple 15 is an ordered, finite sequence of elements, where each 
element is an instance of a certain type. 

To each position in i a tuple t, a type Tj is assigned. The element t[i] at a position i must 
then be an element of Tj. Other than sets, tuples may contain the same element twice. Since 
every item of a tuple may be of a different type, (Monday, 23, {a, b, c}) is a valid tuple. 

In the context of this book, we define tuples with parenthesis like (a, 6, c) whereas sets 
are specified using braces {a, b, c}. 

Definition 27.12 (Tuple Type). To formalize this relation, we define the tuple type T. 
T specifies the basic sets for the elements of its tuples. If a tuple t meets the constraints 
imposed to its values by T, we can write t € T which means that the tuple t is an instance 
of T. 

T=(T 1 ,T 2 ,..T n ),neN (27.28) 
t = (h,t 2 , ..t n ) e U e Ti VO < i < n A len(i) = len(T) (27.29) 

11 http://en.wikipedia.org/wiki/Cartesian_product [accessed 2007-07-03] 

12 http://en.wikipedia.org/wiki/Countable_set [accessed 2007-07-03] 

13 see definition of function on page 462 

14 http://en.wikipedia.org/wiki/Axiom_of_power_set [accessed 2007-07-03] 

15 http://en.wikipedia.org/wiki/Tuple [accessed 2007-07-03] 
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27.6 Lists 

Definition 27.13 (List). Lists 16 are abstract data types which can be regarded as special 
tuples. They are sequences where every item is of the same type. 

Other than our discussions on set theory, the following text about the data structure list 
is strictly local to this book and not to be understood as a general mathematical theory. 
All the functions and operations defined on lists in this book are only given in order to 
allow for a clear and well-defined notation in the other parts of the book, when specifying 
optimization algorithms, for instance. They are not founded on related work by any other 
scientist. 

We introduce functions that will add elements to or remove elements from lists; that sort 
lists or search within them. Like tuples, lists can be defined using parenthesis in this book. 
The single elements of a list are accessed by their index written in brackets ((a, b, c) [l] = b) 
where the first element has the index and the last element has the index n — 1 (while n is 
the count of elements in the list: n = len((a, b, c)) = 3). The empty list is abbreviated with 
()• 

Definition 27.14 (createList). The I — createList(n, q) method creates a new list I of the 
length n filled with the item q. 

I = createList (n, q) ^> lcn(Z) =tiA VO < i < n => = q (27.30) 

Definition 27.15 (insertListltem). The function m = inscrtListItcm(Z, i, q) creates a new 
list m by inserting one element q in a list I at the index < i < len(Z). By doing so, it shifts 
all elements located at index i and above to the right by one position. 

m = insertListItem(Z, i, q) ^ len(m) = lcn(^) + 1 A m[i] = q A 

V? : < j < i => m[j] = A 
V j : i < j < lcn(Z) m[j+i] = l[j] (27.31) 

Definition 27.16 (addListltem). The addListltcm function is a shortcut for inserting one 
item at the end of a list: 

addListItem(Z, q) = insertListItcm(^, lcn(Z) , q) (27.32) 

Definition 27.17 (dclcteListltcm). The function m = deleteListItem(Z, i) creates a new 
list m by removing the element at index < i < len(Z) from the list / (len(^) > i + 1). 

m = deleteListItem(Z, i) <^ lcn(m) = lcn(Z) — 1 A 

Vj : < j < i=> my] = 

Vj :i < j < lcn(l) m[j-i] = l[j] (27.33) 

Definition 27.18 (deleteListRange) . The method m — deleteListRange(Z, i, c) creates a 
new list m by removing c elements beginning at index < i < \en(l) from the list I (len(Z) > 
i + c). 

m — deleteListRange(l, i, c) <^ len(m) = lcn(^) — c A 

Vj : < j < i m[j] = A 

Vj : i + c < j < len(/) m[j-c] = l[j] (27.34) 

Definition 27.19 (appcndList). The function appendList(^i, l 2 ) is a shortcut for adding 
all the elements of a list l 2 to a list l\. We define it recursively as: 



appendList(Zi, l 2 ) 



h if len(Z 2 ) = 

appendList(addListItem(/i, l 2 [o]) , deleteListItem(Z 2 , 0)) otherwise 

(27.35) 



http : // en. wikipedia. org/wiki/List_7,28computing°/ ( 29 [accessed 2007-07-03] 
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Definition 27.20 (countOccurences). The function count Occurences (x, I) returns the 
number of occurrences of the element x in the list I. 

countOccurences(a;, I) = \{i £ . . . len(l) — 1 : l[i] = x}\ (27.36) 

Definition 27.21 (subList). The method subList(Z, i, c) extracts c elements from the list I 
beginning at index i and returns them as a new list. 

subList(Z, i, s) = dclctcListRangc(deleteListRange(Z, 0, i) ,c,\l\ — i — c) (27.37) 

Definition 27.22 (Sorting Lists). It is often useful to have sorted lists 17 . Thus we define 
the functions S = sortList a ([/, cmp) and S = sortList_(f7, cmp) which sort a list U in 
ascending or descending order using a comparator function cmp(ui, u 2 ). 

S = sortList a (£/,cmp) (27.38) 

y U e U3ie [0, len(U) - 1] : S[i\ = u (27.39) 

lcn(S) = \cn(U) (27.40) 

V0 < i < len(U) - 1 => cmp(S , [»], S[i+i], <) (27.41) 

For S = sortListd(C/, cmp), only Equation 27.41 changes, the rest stays valid: 

S = sortList d (£7, s) (27.42) 
V0 < i < len(U) - 1 => cmp(S[H, S[i+i], >) (27.43) 

The concept of comparator functions has been introduced in Definition 1.15 on page 38. 
cmp(u 1 , u 2 ) returns a negative value if U\ is smaller than u 2 j a positive number if U\ is greater 
than v,2, and if both are equal. Comparator functions are very versatile, they from the 
foundation of the sorting mechanisms of the Java framework [838, 837], for instance. In global 
optimization, they are perfectly suited to represent the Pareto dominance or prevalence 
relations introduced in Section 1.2.2 on page 31 and Section 1.2.4. Sorting according to a 
specific function / of only one parameter can easily be performed by building the comparator 
cmp(ui,M 2 ) = (/(tti) — /(w 2 )). Thus, we will furthermore synonymously use the sorting 
predicate also with unary functions /. 

sortList(J7, /) = sortList(C7, cmp(tii, u 2 ) = (/(ui) - f(u 2 ))) (27.44) 

A list U can be sorted in 0(len({7) loglen([/)) time complexity. For concrete examples 
of sorting algorithms, see [1163, 446, 1850]. 

Definition 27.23 (Searching in Unsorted Lists). Searching an element u in an unsorted 
list U means walking through it until either the element is found or the end of the whole 
list has been scanned, which corresponds to complexity 0(len(f )). 

searchltem tt(u , U) = { 4 : ^ =_ « (27.45) 

Definition 27.24 (Searching in Sorted Lists). Searching an element s in sorted list S 
means to perform a binary search 18 returning the index of the element if it is contained in S. 
If s £ S, a negative number is returned indicating the position where the element could be 
inserted into the list without violating its order. The function searchltem as (s, S, )searches in 
an ascending sorted list, searchltem^s, S, )searches in a descending sorted list. Searching 
in a sorted list is done in 0(loglen(5)) time. For concrete algorithm examples, again see 
[1163, 446, 1850]. 

17 http://en.wikipedia.org/wiki/Sorting_algorithm [accessed 2007-07-03] 

18 http://en.wikipedia.org/wiki/Binary_search [accessed 2007-07-03] 
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i : S[%] — s if s e S 

searchltem as (s,S) = { (-i - 1) : (Vj > 0, j < i => S\j] < s)A otherwise (27.46) 

(Vj <len(S),j>i^S\j]>s) 

i : S[i] = s if s E S 

searchltem ds (s,S) = { (-i - 1) : (Vj > 0, j < i _>_;] > s)A otherwise (27.47) 

(Vj<len(5),i>^S[,]< S ) 

Definition 27.25 (rcmovcListltem) . The function rcmoveListltcm^, g) finds one occur- 
rence of an element g in a list I by using the appropriate search algorithm and deletes it 
(returning a new list m). 



m = removeList!tem(Z, <j) 4=> 



Z if scarchltcm((7, Z) < 
deleteListItem(Z, searchltem(g, Z)) otherwise 

(27.48) 



We can further define transformations between sets and lists which will implicitly be used 
when needed in this book. It should be noted that "setToList" is not the inverse function of 
listToSet. 



B = sctToList(se£ A) => Va e A 3i : B[i\ = a A 

Mi e [0, len(S) - 1] => B[{\ e A A 
len(setToList(^)) = \A\ (27.49) 

A = listToSet(Zzst B) Vz G [0, lcn(B) - 1] G A A 

Va e A 3i e [0..1cn(B) - 1] : = a A 
|listToSet(B) | < lcn(S) (27.50) 



27.7 Binary Relations 

Definition 27.26 (Binary Relation). A binary 19 relation 20 R is defined as an ordered 
triple (A, B, P) where A and B are arbitrary sets, and P is a subset of the Cartesian product 
Ax B (see Equation 27.25). The sets A and B are called the domain and codomain of the 
relation and P is called its graph. The statement (a, b) E P : a E A A b E B is read "a 
is i?-related to 6" and is written as R(a, b). The order of the elements in each pair of P is 
important: If a ^ b, then R(a,b) and R(b,a) both can be true or false independently of 
each other. 

Some types and possible properties of binary relations are listed below and illustrated in 
Figure 27.2. A binary relation can be [673]: 

1. Left-total if 

VaE A3bE B : R(a,b) (27.51) 

2. Surjective 21 or right-total if 

VZ> E B 3a E A : R(a,b) (27.52) 

19 http://en.wikipedia.org/wiki/Binary_relation [accessed 2007-07-03] 

20 http : //en. wikipedia. org/wiki/Relation_°/,28mathematics°/,29 [accessed 2007-07-03] 

21 http://en.wikipedia.org/wiki/Surjective [accessed 2007-07-03] 
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A B A B A B 




left-total left-total left-total 

surjective surjective surjective 

non-injective injective injective 

functional functional non-functional 

non-bijective bijective non-bijective 




left-total 

non-surjective 

non-injective 

functional 

non-bijective 



not left-total 

non-surjective 

non-injective 

non-functional 

non-bijective 



left-total 

non-surjective 

injective 

functional 

non-bijective 



not left-total 

non-surjective 

injective 

functional 

non-bijective 



Figure 27.2: Properties of a binary relation R with domain A and codomain B. 



3. Injective 22 if 

4. Functional if 



V<zi , d2 £ A, b e B : R(at,b) A R[a 2 ,b) => a\ = a 2 



Va G A, b%, b 2 G B ; R{a, b x ) A R(a, b 2 ) ^h=b 2 



5. Bijective 23 if it is left-total, right-total and functional. 

6. Transititve if 

Va G A, V6 G B, Ve G A n B : R(a, c) A R(c, b) R(a, b) 



(27.53) 
(27.54) 

(27.55) 



27.7.1 Functions 

Definition 27.27 (Function). A function / is a binary relation with the property that 
for an element x of the domain 2 ' 1 X there is no more than one element y in the codomain 
Y such that x is related to y. This uniquely determined element y is denoted by f{x). In 
other words, a function is a functional binary relation and we can write: 



V.t G X,yi,y 2 G Y : f(x,yi) A f(x,y 2 ) ^y\=Wi 



(27.56) 



A function maps each element of X to one element in Y. The domain X is the set of 
possible input values of / and the codomain Y is the set its possible outputs. The set of all 
actual outputs {f(x) : x £ X} is called range. This distinction between range and codomain 
can be made obvious with a small example. The sine function can be defined as a mapping 
from the real numbers to the real numbers sin : R i— ► R, making R its codomain. Its actual 
range however is just the real interval [—1,1]. 



scd 2007-07-03] 
sed 2007-07-03] 



http : //en. wikipedia. org/wiki/Injective 
http : //en. wikipedia. org/wiki/Bi jective 
http : //en. wikipedia. org/wiki/Transitive_relation 
http : // en. wikipedia. org/wiki/Domain_y,28mathematics°/ 29 [accessed 2007-07-03] 
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Monotonicity 

Real functions are monotone, i. e., have the property of monotonicity 26 , if they preserve a 
given order 27 . 

Definition 27.28 (Monotonically Increasing). A function / : X i— > Y that maps a 
subset of the real numbers X C R to a subset of the real numbers F C lis called monotonic, 
monotonically increasing, increasing, or non-decreasing, if and only if Equation 27.57 holds. 

Vaq < x 2 , x u x 2 £14 /(xi) < f(x 2 ) (27.57) 

Definition 27.29 (Monotonically Decreasing). A function / : X ^Y that maps a sub- 
set of the real numbers X C R to a subset of the real numbers F C lis called monotonically 
decreasing, decreasing, or non-increasing, if and only if Equation 27.58 holds. 

Vaq < x 2 , X!,x 2 el^ f(xi) > f(x 2 ) (27.58) 



27.7.2 Order Relations 



All of us have learned the meaning and the importance of order since the earliest years in 
school. The alphabet is ordered, the natural numbers are ordered, the marks on our school 
reports are ordered, and so on. Matter of fact, we come into contact with orders even way 
before entering school by learning to distinguish things according to their size, for instance. 

Order relations 28 are another type of binary relations which is used to express the order 
amongst the elements of a set A. Since order relations are imposed on single sets, both their 
domain and their codomain are the same (A, in this case). For such relations, we can define 
an additional number of properties which can be used to characterize and distinguish the 
different types of order relations: 



1. Antisymmetric: 

2. Asymmetric 

3. Refiexivenss 

4. Irreflexivenss 



R(a\, a 2 ) A R(a 2 ,a\) a\ = a 2 Vai, a 2 e A (27.59) 

R(ai, a 2 ) => ->R(a2, a_) Voi, a 2 e A (27.60) 

R(a,a) Va e A (27.61) 

fiaeA:R(a,a) (27.62) 



All order relations are transitive 29 , and either antisymmetric or symmetric and either 
reflexive or irreflcxive: 

Definition 27.30 (Partial Order). A binary relation R defines a (non-strict, reflexive) 
partial order if and only if it is reflexive, antisymmetric, and transitive. 

The < and > operators, for instance, represent non-strict partial orders on the set of the 
complex numbers C. Partial orders that correspond to the > and < comparators are called 
strict. The Pareto dominance relation introduced in Definition 1.13 on page 31 is another 
example for such a strict partial order. 

Definition 27.31 (Strict Partial Order). A binary relation R defines a strict (or irreflex- 
ive) partial order if it is irreflcxive, asymmetric, and transitive. 



http://en.wikipedia.org/wiki/Monotonic_function [accessed 2007-08-08] 

27 Order relations are discussed in Section 27.7.2. 

28 http://en.wikipedia.org/wiki/Order_relation [accessed 2007-07-03] 

29 See Equation 27.55 on the facing page for the definition of transitivity. 
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Definition 27.32 (Total Order). A total order 30 (or linear order, simple order) R on the 
set A is a partial order which is complete/total. 

R{a u a 2 ) V R(a 2 ,a 1 ) Vai, a 2 G A (27.63) 

The real numbers K for example are totally ordered whereas on the set of complex 
numbers C, only (strict or reflexive) partial (non-total) orders can be defined because it is 
continuous in two dimensions. 



27.7.3 Equivalence Relations 

Another important class of relations are equivalence relations ' 1 [2093, 2141] which are often 
abbreviated with = or <~, i. e., a\ = a 2 and ai ~ a 2 mean R(ai,a 2 ) for the equivalence 
relation R imposed on the set A and a\,a 2 G A. Unlike order relations, equivalence relations 
are symmetric, i. e., 

R(a!,a 2 ) R{a 2 ,ax) Vai,a 2 G A (27.64) 

Definition 27.33 (Equivalence Relation). The binary relation R defines an equivalence 
relation on the set A if and only if it is reflexive, symmetric, and transitive. 

Definition 27.34 (Equivalence Class). If an equivalence relation R is defined on a set A, 
the subset A 1 C A of A is an equivalence class 32 if and only if Vai, a 2 G A' => R(ai,a 2 ) (ai ~ 
02)- 



http : //en . wikipedia . org/ wiki/Total_order [accoaeod 2007-07-03] 

http : //en. wikipedia. org/wiki/Equivalence_r elation [accoeaod 2007-07-2S] 

http : //en . wikipedia . org/ wiki/Equivalence_class [accessed 2007-07-28] 
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Stochastic Theory and Statistics 



In this chapter we give a rough introduction into stochastic theory 1 [1720, 1264, 1043, 1044], 
which subsumes 

1. probability 2 theory 3 , the mathematical study of phenomena characterized by random- 
ness or uncertainty, and 

2. statistics 4 , the art of collecting, analyzing, interpreting, and presenting data. 

28.1 General Information 
28.1.1 Books 

Some books about (or including significant information about) Stochastic Theory and Statis- 
tics are: 

Kallenberg [1084]: Foundations of modern probability 
Renyi [1720]: Probability Theory 

Tijms [2041]: Understanding Probability: Chance Rules in Everyday Life 
Feller [649]: An Introduction to Probability Theory and Its Applications 
Kallenberg [1085]: Probabilistic symmetries and invariance principles 
Jaynes [1043, 1044]: Probability Theory: The Logic of Science 
Lawler [1264]: Introduction to Stochastic Processes 
Casella and Berger [350] : Statistical Inference 
Lowry [1310]: Concepts and Applications of Inferential Statistics 
Lowry [1313]: VassarStats: Web Site for Statistical Computing 

Siegel and Castellan Jr. [1878]: Nonparametric Statistics for The Behavioral Sciences 

Sheskin [1866]: Handbook of Parametric and Nonparametric Statistical Procedures 

Bhattacharyya and Johnson [205]: Statistical Concepts and Methods 

Bortz, Lienert, and Boehnke [252]: Verteilungsfreie Methoden in der Biostatistik 

Polasek [1652]: Schlieflende Statistik - Einfiihrung in die Schdtz- und Testtheorie fur 

Wirtschaftswissenschaftler 

Edgington [619]: Randomization tests 

Harlow, Mulaik, and Steiger [898]: What If There Were No Significance Tests? 
Dallal [478] : The Little Handbook of Statistical Practice 



http://en.wikipedia.org/wiki/Stochastic [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/Probability [accessed 2007-07-03] 

3 http://en.wikipedia.org/wiki/Probability_theory [accessed 2007-07-03] 

4 http://en.wikipedia.org/wiki/Statistics [accessed 2007-07-03] 
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Heath [912]: An introduction to experimental design and statistics for biology 

Kanji [1091]: 100 Statistical Tests 

Neyman and Pearson [1523]: Joint Statistical Papers 

Lindley and Scott [1290]: New Cambridge Statistical Tables 

Rice [1727]: Mathematical Statistics and Data Analysis 

Kay [1105]: Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory 
Box, Hunter, and Hunter [263] : Statistics for Experimenters: Design, Innovation, and Dis- 
covery 

Fisher [682] : The design of experiments 

Cox and Reid [460] : The Theory of the Design of Experiments 

Fisher [684]: Statistical methods and scientific inference 

Fisher [680]: Statistical Methods for Research Workers 

Casella and Bcrger [351]: Statistical Inference 

Robert and Casella [1744]: Monte Carlo Statistical Methods 

Liu [1294]: Monte Carlo Strategies in Scientific Computing 

Yates [2288]: The Design and Analysis of Factorial Experiments 

Snyder and Miller [1914]: Random Point Processes in Time and Space 

Devroye [556]: Non- Uniform Random Variate Generation 

Poor [1668]: An Introduction to Signal Detection and Estimation 

Van Trees [2100]: Detection, Estimation, and Modulation Theory, Part I 

Simon [1882]: Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches 

Kleinbaum, Kupper, and Muller [1150]: Applied regression analysis and other multivariate 

methods 

Draper and Smith [595]: Applied regression analysis 

Fox [739]: Applied Regression Analysis, Linear Models, and Related Methods 

Banks [134]: Handbook of Simulation: Principles, Methodology, Advances, Applications, and 

Practice 

Mackeown [1339]: Stochastic Simulation in Physics 
Osborne and Rubinstein [1587]: A Course in Game Theory 
Fudenberg and Tirole [752]: Game Theory 

Kindermann and Snell [1139]: Markov Random Fields and Their Applications 
Bennett [178]: The Collected Papers of R.A. Fisher 



28.2 Probability 

Probability theory is used to determine the likeliness of the occurrence of an event under 
ideal mathematical conditions. [1084, 1085] 

Definition 28.1 (Random Experiment). Random experiments can be repeated arbi- 
trary often, their results cannot be predicted. 

Definition 28.2 (Elementary Event). The possible outcomes of random situations are 
called elementary events or samples u. 

Definition 28.3 (Sample Space). The set of all possible outcomes (elementary events, 
samples) of a random situation is the sample space fl — {uJi : i <G 1..N — 

When throwing dice 5 , for example, the sample space will be £2 — u)\,u)2, u>4, uj^,wq 
whereas u>i means that the number i was thrown. 

Definition 28.4 (Random Event). A random event A is a subset of the sample space fl 
(A C Q). If uj € A occurs, then A is occurs too. 

5 Throwing a dice is discussed as example for stochastic extensively in Section 28.6 on page 497. 



28.2 Probability 467 



Definition 28.5 (Certain Event). The certain event is the random event will always 
occur in each repetition of a random experiment. Therefore, it is equal to the whole sample 
space ft. 

Definition 28.6 (Impossible Event). The impossible event will never occur in any rep- 
etition of a random experiment, it is defined as 0. 

Definition 28.7 (Conflicting Events). Two conflicting events A\ and A 2 can never occur 
together in a random experiment. Therefore, A\ n A 2 = 0. 



28.2.1 Probabily as defined by Bernoulli (1713) 

In some idealized situations, like throwing ideal coins or ideal dice, all elementary events of 
the sample space have the same probability de Laplace [523]. 

P(u) = 7^7 e n (28.1) 

Equation 28.1 is also called the Laplace-assumption. If it holds, the probability of an 
event A can be defined as: 

number of possible events in favor of A \A\ 
number of possible events \Q\ 

For many random experiments of this type, we can use combinatorical 6 approaches in 
order to determine the number of possible outcomes. Therefore, we want to shortly outline 
the mathematical concepts of factorial numbers, combinations, and permutations. 7 

Definition 28.8 (Factorial). The factorial 8 n\ of a number n € No is the product of n 
and all natural numbers smaller than it. It is a specialization of the Gamma function for 
positive integer numbers, see Section 28.10.1 on page 532. 



n 

n! = JJi (28.3) 

i=l 

0! = 1 (28.4) 

In combinatorial mathematics 9 , we often want to know in how many ways we can ar- 
range n £ N elements from a set Q with M = \£2\ > n elements. We can distinguish between 
combinations, where the order of the elements in the arrangement plays no role, and permu- 
tations, where it is important, (a, b, c) and (c, b, a), for instance, denote the same combination 
but different permutations of the elements {a, b, c}. We furthermore distinguish between ar- 
rangements where each element of fl can occurred at most once (without repetition) and 
arrangements where the same elements may occur multiple time (with repetition). 

Combinations 

The number of possible combinations 10 C(M, n) of n £ N elements out of a set Q with 
M = | J? | > n elements without repetition is 

6 http://en.wikipedia.org/wiki/Combinatorics [accessed 2007-07-03] 

7 http://en.wikipedia.org/wiki/Combinations_and_permutations [acceded 2007-07-03] 

8 http://en.wikipedia.org/wiki/Factorial Kccscd 2007-07-03] 

9 http://en.wikipedia.org/wiki/Combinatorics [acceded 2008-01-31] 
10 http://en.wikipedia.org/wiki/Combination [accessed 2008-01-31] 
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(M\ M\ 

C{M,n) = = -— (28.5) 

v ; \n J n\ (M- n)\ ' 

M\ M M-\ M-2 M - n + 1 , . 

= — * * * .. * (28.6) 

n J n n-1 n-2 1 K J 

C(M + l,n) = C(M,n) + C(M,n- 1) = ( M + ^ = ^ + (J^) (28.7) 

If the elements of fl may repeatedly occur in the arrangements, the number of possible 
combinations becomes 



M + n - 1! _ (M + n - 1\ _ / M + n — 1 

n!(M- 1)! ~ V " / V 



C(M + n - 1, n) = C(M + n - 1, n - 1) 

(28.8) 



Permutations 

The number of possible permutations 11 Perm(M,n) of n e N elements out of a set £2 with 
M = \Q\ > n elements without repetition is 

Perm(M, n) = (M)„ = (28.9) 

If an element from Q can occur more than once in the arrangements, the number of possible 
permutations is 

M n (28.10) 
28.2.2 The Limiting Frequency Theory of von Mises 

If we repeat a random experiment multiple times, the number of occurrences of a certain 
event should somehow reflect its probability. The more often we perform the experiment, 
the more reliable will the estimations of the event probability become. We can express this 
relation using the notation of frequency. 

Definition 28.9 (Absolute Frequency). The number H(A,n) denoting how often an 
event A occurred during n repetitions of a random experiment is its absolute frequency 12 . 

Definition 28.10 (Relative Frequency). The relative frequency h(A,n) of an event A 
is its absolute frequency normalized to the total number of experiments n. The relative 
frequency has the following properties: 

h(A,n) = H ^ ) - (28.11) 
n 

0<h(A,n)<l (28.12) 

h(H, n) = 1 Vn e N (28.13) 

A n B = => h(A U B, n) = H ( A > n ) + H ( B > n ) = h{A ^ n) + h ^ B ^ n) (2 8.14) 

n 

According to von Mises [2120], the (statistical) probability P(A) of an event A computing 
the limit of its relative frequency h(A, n) as n approaching infinity. This is the limit of the 
quotient of the number of elementary events favoring A and the number of all possible 
elementary events for infinite many repetitions. [2120, 2121] 

P(A)= lim h(A,n)= lim — (28.15) 

n^oa n^oo n 

11 http://en.wikipedia.org/wiki/Permutations [accessed 2008-01-31] 

12 http : //en. wikipedia. org/wiki/Frequency.'/^Sstatistics'/^Q [accessed 2007-07-03] 
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28.2.3 The Axioms of Kolmogorov 

Definition 28.11 (er-algebra). A subset S of the power set V(fl) is called er-algebra 1 ' 5 , if 
the following axioms hold: 

n E S (28.16) 

DeS (28.17) 
AeS^AeS (28.18) 
Ae S AB E S ^> (AUB) E S (28.19) 

From these axioms others can be deduced, for example: 

A E S A B E S ^-AeSABeS (28.20) 

^AUBES (28.21) 
=> AnB E S 

AeSABeS ^{AC)B)eS (28.22) 

Definition 28.12 (Probability Space). A probability space (or random experiment) is 
defined by the triple {fi, S, P) whereas 

1. Q is the sample space, a set of elementary events, 

2. S is a CT-algebra defined on _2, and 

3. Pui defines a probability measure 14 that determines the probability of occurrence for 
each event ui £ Q. (Kolmogorov [1169] axioms 15 ) 

Definition 28.13 (Probability). A mapping P which maps a real number to each el- 
ementary event w E £1 is called probability measure if and only if the cr-algebra S on J? 
holds: 

E S => < P(A) < 1 (28.23) 
P{Q) = 1 (28.24) 

V 'disjoint A, e S => P(A) = P ( \J A { ) = P ( A *) (28.25) 

\ Vi / Vj 

From these axioms, it can be deduced that: 



P(0) 
P(A) 
P(A n B) 

P(AUB) 



l-P(A) 

P(A) — P(A n B) 

P(A) + P(B) - P(A n B) 



(28.26) 
(28.27) 
(28.28) 
(28.29) 



28.2.4 Conditional Probability 

Definition 28.14 (Conditional Probability). The conditional probability 1 '' P(A\B) is 
the probability of some event A, given the occurrence of some other event B. P(A\B) is 
read "the probability of A, given B" . 

13 http://en.wikipedia.org/wiki/Sigma-algebra [accessed 2007-07-03] 

14 http://en.wikipedia.org/wiki/Probability_measure [accessed 2007-07-03] 

15 http://en.wikipedia.org/wiki/Kolmogorov_axioms [accessed 2007-07-03] 

16 http://en.wikipedia.org/wiki/Conditional_probability [accessed 2007-07-03] 
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P(A\B) = (28.30) 

P(AnB) = P(A\B)P(B) (28.31) 

Definition 28.15 (Statistical Independence). Two events A and B are (statistically) 
independent if and only if P(A (~l B) = P{A) P(B) holds. If two events A and B are statis- 
tically independent, we can deduce: 

P(AnB) = P(A)P{B) (28.32) 
P(A\B)=P(A) (28.33) 
P(B\A)=P(B) (28.34) 

28.2.5 Random Variable 

Definition 28.16 (Random Variable). The function X which relates the sample space 
fi to the real numbers R is called random variable 17 in the probability space (fi, S, P). 

X : fi \ — > R (28.35) 

Using such a random variable, we can replace the sample space fi with the new sample 
space fix ■ Furthermore, the a-algebra S can be replaced with a cr-algcbra Sx, which consists 
of subsets of fix instead of fi. Last but not least, we replace the probability measure P which 
relates the uo e fi to the interval [0, 1] by a new probability measure Px which relates the 
real numbers R to this interval. 

Definition 28.17 (Probability Space of a Random Variable). Is X : fi i— » R a random 
variable, then the probability space of X is defined as the triplet 

(fi x ,S x ,Px) (28.36) 

One example for such a new probability measure would be the probability that a random 
variable X takes on a real value which is smaller or equal a value x: 

Px{X <x) = P({lo -.LoefiA X(u) < x}) (28.37) 



28.2.6 Cumulative Distribution Function 

Definition 28.18 (Cumulative Distribution Function). If X is a random variable of 
a probability space (fix = R, Sx,Px), we call the function Fx : R i— > [0, 1] the (cumulative) 
distribution function 18 (CDF) of the random variable Xii it fulfills Equation 28.38. 

Fx ■= Px(X<x) = P({lu :uj e fi AX(lu) <x}) (28.38) 

definition rnd. var. definition probability space 

A cumulative distribution function Fx has the following properties: 
1. Fx(X) is normalized: 

lim F x (x)=0, lim F x {x) = l (28.39) 

X — ► — OO ' + OC 



impossible event certain event 

17 http://en.wikipedia.org/wiki/Random_variable [accessed 2007-07-03] 

18 http://en.wikipedia.org/wiki/Cumulative_distribntion_function [accessed 2007-07-03] 
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2. Fx{X) is monotonously 19 growing: 

F x (xi) < F x {x 2 ) Vsi < x 2 (28.40) 

3. F X {X) is (right-sided) continuous 20 : 

lim F x (x + h) = F x {x) (28.41) 

h— >0 

4. The probability that the random variable X takes on values in the interval x < X < x\ 
can be computed using the CDF: 

P{x <X <x 1 )=Fx(x 1 )-F x (x Q ) (28.42) 

5. The probability that the random variable X takes on the value of a single random 
number x: 

P(X = x) = F x (x) - lim F x (x - h) (28.43) 

h— >0 

We can further distinguish between sample spaces Q which contain at most countable 
infinite many elements and such that are continuums. Hence, we there are discrete 21 and 
continuous 22 random variables. 

Definition 28.19 (Discrete Random Variable). A random variable X (and its proba- 
bility measure Px (X) respectively) is called discrete if it takes on at most countable infinite 
many values. Its cumulative distribution function Fx(X) therefore has the shape of a stair- 
way. 

Definition 28.20. A random variable X (and its probability measure Px respectively) is 
called continuous if it can take on uncountable infinite many values and its cumulative 
distribution function Fx{X) is also continuous. 

28.2.7 Probability Mass Function 

Definition 28.21 (Probability Mass Function). The probability mass function 23 
(PMF) fx is defined discrete distributions only and assigns a probability to each value 
a discrete random variable X can take on. 

fx : Z ^ [0, 1] : f x (x) := P X (X = x) (28.44) 

Therefore, we can specify the relation between the PMF and its corresponding(discrete) 
CDF in Equation 28.45 and Equation 28.45. We can further define the probability of an 
event A in Equation 28.47 using the PMF. 



X 

Px(X < x) = F x (x) = ]T fx(x) (28.45) 

2— — OO 

Px (X = x) = fx (x) = F x (x) -Fx(x-l) (28.46) 

Px(A) = J2 fx{x) (28.47) 



http : //en. wikipedia. org/wiki/Monotonicity [accessed 2007-07-03] 

http : //en. wikipedia. org/wiki/Continuous_f unction [accessed 2007-07-03] 

http : //en. wikipedia. org/wiki/Discrete_random_variable [accessed 2007-07-03] 

http : //en. wikipedia. org/wiki/Continuous_probability .distribution [accessed 2007-07-03] 

http : / /en . wikipedia . org/ wiki/Probability_mass_f unction [accessed 2007-07-03] 
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28.2.8 Probability Density Function 

The probability density function 24 (PDF) is the counterpart of the PMF for continuous 
distributions. The PDF does not represent the probabilities of the single values of a random 
variable. Since a continuous random variable can take on uncountable many values, each 
distinct value itself has the probability 0. If we, for instance, picture the current temperature 
outside as (continuous) random variable, the probability that it takes on the value 18 for 
18° C is zero. It will never be exactly 18° C outside, we can at most declare with a certain 
probability that we have a temperature between 17.99999°C and 18.00001°C. 

Definition 28.22 (Probability Density Function). If a random variable X is continu- 
ous, its probability density function fx is defined as 



28.3 Stochastic Properties 

Each random variable X which conforms to a probability distribution Fx may have certain 
properties such as a maximum and a mean value, a variance, and a value which will be taken 
on by X most often. If the cumulative distribution function Fx of X is known, these values 
can usually be computed directly from its parameters. On the other hand, it is possible that 
we only know the values A[i] which X took on during some random experiments. From this 
set of sample data A, we can estimate the properties of the underlying (possibly unknown) 
distribution of X using statistical methods (with a certain error, of course). 

In the following, we will elaborate on the properties of a random variable X <G M both 
from the viewpoint of knowing the PMF/PDF fx{x) and the CDF Fx(x) as well as from 
the statistical perspective, where only a sample A of past values of X is known. In the 
latter case, we define the sample as a list A with the length n — \en(A) and the elements 
A[i] :ie [0,n- 1]. 

28.3.1 Count, Min, Max and Range 

The most primitive features of a random distribution are the minimum, maximum, and the 
range of its values, as well as the number of values A\%] in a data sample A. 

Definition 28.23 (Count), n = len(A) is the number of elements in the data sample A. 

This item count is only defined for data samples, not for random variables, since random 
variables represent experiments which can infinitely be repeated and thus stand for infinitely 
many values. The number of items should not be mixed up with the possible number of dif- 
ferent values the random variable may take on. A data sample A may contain the same value 
a multiple times. When throwing a dice seven times, one may throw A = (1, 4, 3, 3, 2, 6, 1), 
for example 25 . 

Definition 28.24 (Minimum). There exists no smaller clement in the sample data A 
than the minimum sample a = min (A) when speaking statistics. From the perspective 
of the cumulative distribution function Fx, the minimum is the lower boundary x of the 
random variable X(ov negative infinity, if no such boundary exists). Both definitions are 
fully compliant to Definition 1.10 on page 25. 



http : //en. wikipedia. org/wiki/Probability_density_f unction [accessed 2007-07-03] 

Throwing a dice is discussed as example for stochastic extensively in Section 28.6 on page 497. 




(28.48) 



min (A) = a e A : Va e A^> a < A 

x = min (X) F x {x) > A F x (x) > F x (x) Viel 



(28.49) 
(28.50) 
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Definition 28.25 (Maximum). In statistically evaluated sample data, exists element big- 
ger than the maximum a = max (A) in A. The value x is the upper boundary of the values 
a random variable X may take on (or positive infinity, if X is unbounded). This definition 
is compliant with Definition 1.9 on page 25. 

max (A) =ae A: Va e A^- a> A (28.51) 
x = max (X) 4^ F x {x) > F x (x) Viel (28.52) 

Definition 28.26 (Range). 

The range range(A) of the sample data A is the difference of the maximum max (A) 
and the minimum min (A) of A and therefore represents the span covered by the data. If a 
random variable X is limited in both directions, it has a finite range range(X), otherwise 
this range is infinite too. 

range(^4) = a — a = max (A) — min (A) (28.53) 
range(X) = x — x = max (X) - min (X) (28.54) 



28.3.2 Expected Value and Arithmetic Mean 

The expected value EX and the a are basic measures for random variables and data samples 
that help us to estimate the regions where their values will be distributed around. 

Definition 28.27 (Expected Value). The expected value 26 of a random variable X is 
the sum of the probability of each possible outcome of the random experiment multiplied 
by the outcome value. It is abbreviated by EX or fi. For discrete distributions it can be 
computed using Equation 28.55 and for continuous ones Equation 28.56 holds. 

oo 

EX = x fx(x) (28.55) 

X — — OO 

/OO 
xf x (x)dx (28.56) 
-oo 

If the expected value EX of a random variable X is known, the following statements can 
be derived the expected values of some related random variables as follows: 

Y = a + X^EY = a + EX (28.57) 
Z = bX => EZ = bEX (28.58) 

Definition 28.28 (Sum). The sum(^4) represents the sum of all elements in a set of data 
samples A. This value does, of course, only exist in statistics. 

n-l 

sum(A) = Y A ® (28.59) 

i=0 

Definition 28.29 (Arithmetic Mean). The arithmetic mean 27 a is the sum of all el- 
ements in the sample data A divided by the total number of values. In the spirit of the 
limiting frequency method of von Mises [2120], it is an estimation of the expected value 
a w EX of the random variable X that produced the sample data A. 



n — 1 ro — 1 

sumM) 

a = = 

n i=0 i=0 



^ n— 1 n— 1 

-^2A[ii = ^2h{A[ii,n) (28.60) 



26 http://en.wikipedia.org/wiki/Expected_value [accessed 2007-07-03] 

27 http://en.wikipedia.org/wiki/Arithmetic_mean [accessed 2007-07-03] 
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28.3.3 Variance and Standard Deviation 

The variance 28 [677] is a measure of statistical dispersion. It illustrates how close the results 
of a random variable or the elements a in a data sample A are to their expected value EX 
or their arithmetical mean a. 

Definition 28.30 (Variance of a Random Variable). The variance D 2 X = var(X) = 
a 2 of a random variable X is defined as 



var(X) = D 2 X = E (X - EX) 2 = E[X 2 } - (EX) 



(28.61) 



The variance of a discrete random variable X can be computed using Equation 28.62 
and for continuous distributions, Equation 28.63 will hold. 



D 2 X = J2 fx(x)(x-EX) 2 = J2 * 2 fx(x)- 



X— — OD 

oo 



X— — OQ 



X— — OQ 



(EXf 



(28.62) 



/oo poo r poo 

(x-EX) 2 dx = I x 2 f x (x)dx- / xf x (x) 
- oo J — oo \_J — oc 



dx 



oo 
oo 



x 2 f x (x) dx 



(EXf 



(28.63) 



If the variance D 2 X of a random variable X is known, we can derive the variances of 
some related random variables as follows: 



Y = a + X D Y = D 2 X 
Z = bX => D 2 Z = b 2 D 2 X 



(28.64) 
(28.65) 



Definition 28.31 (Sum of Squares). The function sumSqrs(yl) is only defined for sta- 
tistical data and represents the sum of the squares of all elements in the data sample A. 



n-l 



sumSqrs(,4) = ^ {Mi\f 



(28.66) 



i=0 



Definition 28.32 (Variance Estimator). We define the (unbiased) estimator 29 s 2 of 
the variance of the random variable which produced the sample values A according to 
Equation 28.67. The variance is zero for all samples with (n = len(A)) < 1. 



1 n-l 1 / 

= /Z ( A[i] ~ ^ == sum Sqrs(A) - 

i=0 \ 



(sum(A)) z 



(28.67) 



Definition 28.33 (Standard Deviation). The standard deviation 50 is the square root of 
the variance. The standard deviation of a random variable X is abbreviated with DX and 
<t, its statistical estimate is s. 



DX = VD 2 X 

s = Vs 2 

The standard deviation is zero for all samples with n < 1. 



(28.68) 
(28.69) 



http : //en . wikipedia . org/ wiki/Var iance [accessed 2007-07-03] 
see Definition 28.55 on page 499 

http : //en. wikipedia. org/wiki/Standard_deviation [accessed 2007-07-03] 
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Definition 28.34 (Coefficient of Variation). The coefficient of variation ' 1 cy of a ran- 
dom variable X is the ratio of the standard deviation by expected value of X. For data 
samples, its estimate c~v is defined as the ration of the estimate of the standard deviation 
and the arithmetic mean. 

" - if - 1 (28 -™> 



n /sumSqrs(^)- ( sum ( A » 2 

C *V = 7-rv \ \ ~ 28 - 71 

sum(A) V n-l 

Definition 28.35 (Covariance). The covariance 32 cov(X,Y) of two random variables X 
and Y is a measure for how much they are related. It exists if the expected values EX 2 and 
EY 2 exist and is defined as 

cov(X, Y) = E[X - EX] * E[Y - EY] (28.72) 
= E[X *Y}- EX * EY (28.73) 

(28.74) 

If X and Y are statistically independent, then their covariance is zero, since 

E[X *Y]=EX*EY (28.75) 
Furthermore, the following formulas hold for the covariance 



D 2 X 


= cov(X, X) 




(28.76) 


D 2 [X + Y] 


= cov{X + Y, X + Y) = 


D 2 X + D 2 Y + 2cov(X, Y) 


(28.77) 


D 2 [X - Y] 


= cov(X — Y,X + Y) = 


D 2 X + D 2 Y - 2cov(X, Y) 


(28.78) 


cov(X,Y) 


= cov(y, X) 




(28.79) 


cov(aX,Y) 


= a cov(F, X) 




(28.80) 


cov{X + Y, Z) 


= cov(X, Z) + cov(Y, Z) 




(28.81) 


cov(aX + b, cY + d) 


= ac cov(X, Y) 




(28.82) 



28.3.4 Moments 

Definition 28.36 (Moment). The k th moment 33 n' k (c) about a value c is defined for a 
random distribution X as 

/4(c) = E[(X - c) k ] (28.83) 

It can be specified for discrete (Equation 28.84) and continuous (Equation 28.85) probability 
distributions using Equation 28.55 and Equation 28.56 as follows. 

oo 

4(c) = E Sx{x){x-c) k (28.84) 

X— — OO 

/oo 
f x (x)(x-c) k dx (28.85) 
-oo 

Definition 28.37 (Statistical Moment). The k th statistical moment fj,' k of a random 
distribution is its k*' 1 moment about zero, i. e., the expected value of its values raised to the 
k th power. 

/4=/4(0) (28.86) 

Definition 28.38 (Central Moment). The k th moment about the mean (or central mo- 
ment) 34 is the expected value of the difference between elements and their expected value 



http : //en. wikipedia. org/wiki/Coef f icient_of .variation [accessed 2007-07-03] 
http : //en . wikipedia . org/ wiki/Covariance [accessed 2008-02-05] 
http : //en. wikipedia. org/wiki/Moment_y,28mathematics7 29 [accessed 2008-02-01] 
http : //en. wikipedia. org/wiki/Moment_about_the_mean [accessed 2007-07-03] 
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raised to the k th power. 

P-k = E 



(X - EX) 



(28.87) 



Hence, the variance D 2 X equals the s central moment [x-i. 



Definition 28.39 (Standardized Moment). The k th standardized moment fi^ k is the 
quotient of the k th central moment and the standard deviation raised to the k th power. 



Ha,k = —r 



(28.88) 



28.3.5 Skewness and Kurtosis 

The two other most important moments of random distributions are the skewness 71 and 
the kurtosis 72 and their estimates G\ and G2. 

Definition 28.40 (Skewness). The skewness' 5 71, the third standardized moment, is 
a measure of asymmetry of a probability distribution. If 71 > 0, the right part of the 
distribution function is either longer or fatter (positive skew, right-skewed). If 71 < 0, the 
distribution's left part is longer or fatter. 



71 = Mff,3 



o-3 



(28.89) 



For sample data A the skewness of the underlying random variable is approximated with 
the estimator G\ where s is the estimated standard deviation. The sample skewness is only 
defined for sets A with at least three elements. 



Gi 



n-l 

E 



(„-!)(„ -2)^ 



A[i] - a 



(28.90) 



Definition 28.41 (Kurtosis). The excess kurtosis 36 72 is a measure for the sharpness 
of a distribution's peak. A distribution with a high kurtosis has a sharper "peak" and 
fatter "tails" , while a distribution with a low kurtosis has a more rounded peak with wider 
"shoulders". The normal distribution (see Section 28.5.2) has a zero kurtosis. 



72 = A*<r,4 - 3 



c 3 



(28.91) 



For sample data A which represents only a subset of a greater amount of data, the sample 
kurtosis can be approximated with the estimator G2 where s is the estimate of the sample's 
standard deviation. The kurtosis is only defined for sets with at least four elements. 



G 2 



n(n + l) 



(n-l)(n-2)(n-3) j 



Ak 



3(n-l) 2 
(n-2)(n-3) 



(28.92) 



28.3.6 Median, Quantiles, and Mode 

Definition 28.42 (Median). The median m — med(X) is the value right in the middle 
of a sample or distribution, dividing it into two equal halves. Therefore, the probability of 
drawing an element less than med(X) is equal to the probability of drawing an element 
larger than m. 

P(X < m) > ^ A P(X > to) > X - A P(X < to) < P(X > to) (28.93) 

35 http://en.wikipedia.org/wiki/Skewness [accessed 2007-07-03] 

36 http://en.wikipedia.org/wiki/Kurtosis [accessed 2008-02 01] 
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We can determine the median m of continuous and discrete distributions by solving 
Equation 28.94 and Equation 28.95 respectively. 

\ = [ f x (x)dx (28.94) 

Z J -oo 
m—1 1 oo 

E fx(x)<-<Y / fx(x) (28.95) 

i— — oo i— m 

(28.96) 

If a sample A has an odd element count, the median m is the element in the middle, 
otherwise (in a set with an even element count there exists no single "middle" -clement), 
the arithmetic mean of the two middle elements. The median represents the dataset in an 
unbiased manner. If you have, for example, the dataset A = (1, 1, 1, 1, 1, 2, 2, 2, 500 000), the 
arithmetic mean, biased by the large element 500 000 would be very high (55556.7). The 
median however would be 1 and thus represents the sample better. The median of a sample 
can be computed as: 

A s = sortList a (A, >) (28.97) 

, . f ^4^1 if n = len(A) is odd 

mod (A) = { | (__ a [3] + __,[»-_]) otherwise ^ 

Definition 28.43 (Quantile). Quantiles 37 are points taken at regular intervals from a 
sorted dataset (or a cumulative distribution function) . The g-quantilcs divide a distribution 
function Fx or data sample A into q parts Tj with equal probability. They can be regarded 
as the generalized median, or vice versa, the median is the 2-quantile. 

V x e R, i e [0, q - 1] ^ < P(x e Ti) (28.99) 

A sorted data sample is divided into q subsets of equal length by the g-quantiles. The 
cumulative distribution function of a random variable X is divided by the g-quantiles into q 
subsets of equal area. The quantiles are the boundaries between the subsets/areas. Therefore, 
the k th g-quantile is the value £ so that the probability that the random variable (or an 
element of the data set) will take on a value less than £ is at most | and the probability 

that it will take on a value greater than or equal to C is at most q= ^-. There exist q — 1 

g-quantiles (k spans from 1 to q — 1). The k*' 1 g-quantile quantile^ (A) of a dataset A can be 
computed as: 



A s = sortList a (A, >) (28.100) 
quantilc^A) = A s [[!^\] (28.101) 

For some special values of q, the quantiles have been given special names too (see Ta- 
ble 28.1). 

Definition 28.44 (Interquartile Range). The interquartile range 38 is the range between 
the first and the third quartile and defined as quantile\{X) — quantile\(X) . 

Definition 28.45 (Mode). The mode 39 is the value that most often occurs in a data sam- 
ple or is most frequently assumed by a random variable. There exist unimodal distribution- 
s/samples that have one mode value and multimodal distributions/samples with multiple 
modes. 

In [2119, 258] you can find further information of the relation between the mode, the mean 
and the skewness. 



37 http://en.wikipedia.org/wiki/Quantiles [accessed 2007-07-03] 

38 http://en.wikipedia.org/wiki/Inter-quartile_range [accessed 2007-07-03] 

39 http://en.wikipedia.org/wiki/Mode_7.28statistics7.29 [accessed 2007-07-03] 
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q name 

100 percentiles 
10 deciles 
9 noniles 
5 quintiles 
4 quartiles 
2 median 



Table 28.1: Special Quantiles 

28.3.7 Entropy 

Definition 28.46 (Information Entropy). The information entropy 40 H(X)defined by 
Shannon [1858] is a measure of uncertainty for discrete probability mass functions fx of 
random variables X or data sets A. It is defined in as follows. The h(a, n) in Equation 28.103 
denotes the relative frequency of the value a amongst the n samples in A. 

oo / \ \ °° 

H(X) = fx(x)\og 2 yj^yj) = - E fx(x)log 2 f x (x) (28.102) 

— OO 3? OO 

R(A) = - H a , n) log 2 h(a, n) (28.103) 

VaeA 

Definition 28.47 (Differential Entropy). The differential (also called continuous) en- 
tropy h(X) is a generalization of the information entropy to continuous probability density 
functions f x of random variables X. [1266] 

poo 

h(X) = - f x (x)lnf x (x)dx (28.104) 



28.3.8 The Law of Large Numbers 

The law of large numbers (LLN) combines statistics and probability by showing that if an 
event e with the probability P(e) = p is observed in n independent repetitions of a random 
experiment, its relative frequency h(e,n) (see Definition 28.10) converges to its probability 
p if n becomes larger. 

In the following, assume that the A is an infinite sequence of samples from equally 
distributed and pairwise independent random variables Xi with the (same) expected value 
EX. The weak law of large numbers states that the mean a of the sequence A converges to 
a value in (EX — e, EX + e) for each positive real number e > 0, e e M+. 

lim P(\a - EX\ < e) = 1 (28.105) 

n — >oc 

In other words, the weak law of large numbers says that the sample average will converge 
to the expected value of the random experiment if the experiment is repeated many times. 

According to the strong law of large numbers, the mean a of the sequence A even con- 
verges to the expected value EX of the underlying distribution for infinite large n 

p( lim a = EX] = 1 (28.106) 

The law of large numbers implies that the accumulated results of each random experi- 
ment will approximate the underlying distribution function if repeated infinitely (under the 
condition that there exists an invariable underlying distribution function). 



http : //en. wikipedia. org/wiki/Inf ormation_entropy [accessed 2007-07-03] 
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28.4 Some Discrete Distributions 

In this section we will introduce some common discrete distributions. Discrete probability 
distributions assign probabilities to the elements of a finite (or, at most, countable infinite) 
set of discrete events/outcomes of a random experiment. 

Parts of the information provided in this and the following section have been obtained 
from Wikipedia [2219]. 

28.4.1 Discrete Uniform Distribution 

The uniform distribution exists in a discrete 41 as well as in a continuous form. In this 
section we want to discuss the discrete form whereas the continuous form is elaborated on 
in Section 28.4.1. 

All possible outcomes uj e fl of a random experiment which obeys the uniform distri- 
bution have exactly the same probability. In the discrete uniform distribution, fl has at 
most countable infinite elements (although normally being finite). The best example for this 
distribution is throwing an ideal dice. This experiment has six possible outcomes LUi where 
each has the same probability P(iOi) = g. Throwing ideal coins and drawing one element 
out of a set of n possible elements arc other examples where a discrete uniform distribution 
can be assumed. Table 28.2 contains the characteristics of the discrete uniform distribu- 
tion. In Figure 28.1 you can find some example uniform probability mass functions and in 
Figure 28.2 we have outlined their according cumulative distribution functions. 



parameter definition 



parameters a, b G Z, a > b 

\Q\ Q\ — r = range = b - 

P(X = x) = fx(x) = 



PMF 



CDF 



mean 

median 

mode 

variance 

skewness 

kurtosis 
entropy 
mgf 

char. func. 



P(X < x) = F x {x) 



EX -- 
med 



a + b 



a + b 

2 

mode = 
D 2 X 



12 



_ _ 6(^ + 1) 

72 — B( r 2_ 1) 



71 =0 

72 = — 
H(X) = lnr 
M x {t) 
<Px(t) ~- 



.(M-i)t 



r(l-e*) 
«t_ i(b+l)t 



r(l- 



(28.107) 

a + 1 (28.108) 
f i i£a<x<b,x€Z 
[ otherwise 

!0 if x < a 
j if a < < fe (28.110) 
1 otherwise 

(28.111) 
(28.112) 
(28.113) 
(28.114) 
(28.115) 
(28.116) 
(28.117) 
(28.118) 
(28.119) 



Table 28.2: Parameters of the discrete uniform distribution. 



http : //en. wikipedia. org/wiki/Uniform_distribution_y,28discrete'/,29 [accc BS od 2007-07-03] 
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Figure 28.1: The PMFs of some discrete uniform distributions 
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Figure 28.2: The CDFs of some discrete uniform distributions 



28.4.2 Poisson Distribution tt\ 

The Poisson distribution 42 ir\ [20] complies with the reference model telephone switchboard. 
It describes a process where the number of events that occur (independently of each other) 
in a certain time interval only depends on the duration of the interval and not of its position 
(prehistory). Events do not have any aftermath and thus, there is no mutual influence of non- 
overlapping time intervals (homogeneity). Furthermore, only the time when an even occurs 
is considered and not the duration of the event. In the telephone switchboard example, we 
would only be interested in the time at which a call comes in, not in the length of the call. 
In this model, no events occur in infinitely short time intervals. The features of the Poisson 

42 http://en.wikipedia.org/wiki/Poisson_distribution [accessed 2007-07-03] 
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distribution are listed in Table 28.3 and examples for its PDF and CDF are illustrated in 
Figure 28.3 and Figure 28.4. 



narampfpr 


H c*f\ n it ion 




parameters 


A — fib S> U 


foe i on^ 


PMF 


P(X =x) = fx{x) = ^e-^ = ^e- x 


(28.121) 


CDF 


P(X <x) = F x (x) = r(lfc +V' A) = £?=o T 


(28.122) 


mean 


£X = fit = A 


(28.123) 


median 


med^ LA + | - 


(28.124) 


mode 


mode = L-^J 


(28.125) 


variance 


D 2 X =fit = \ 


(28.126) 


skewness 


71 = A" 2 


(28.127) 


kurtosis 


72 = X 


(28.128) 


entropy 


H(X) = A (1 — In A) + e" A £~o 


(28.129) 


mgf 


M x (t) = e A < et - 1 » 


(28.130) 


char. func. 




(28.131) 



Table 28.3: Parameters of the Poisson distribution. 




Poisson Process 

The Poisson process 44 [1914] is a process that obeys the Poisson distribution - just like the 
example of the telephone switchboard mentioned before. Here, A is expressed as the product 
of the intensity fi and the time t. fi normally describes a frequency, for example fi = — j-. 
Both, the expected value as well as the variance of the Poisson process are A = fit. In 

43 The r in Equation 28.122 denotes the (upper) incomplete gamma function. More information 
on the gamma function r can be found in Section 28.10.1 on page 532. 

44 http://en.wikipedia.org/wiki/Poisson_process [accessed 2007-07-03] 
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Figure 28.4: The CDFs of some Poisson distributions 



~20~ 



Equation 28.132, the probability that k events occur in a Poisson process in a time interval 
of the length t is defined. 

P(X t = k)= {j ^-e-^ = ^e- x (28.132) 

The probability that in a time interval [t, t + At] 

1. no events occur is 1 — XAt + o(At). 

2. exactly one event occurs is XAt + o(At). 

3. multiple events occur o(At). 

Here we use an infinitesimal version the small-o notation. 45 The statement that / £ 
o(£) => I <C 1^(^)1 is normally only valid for x — > 00. In the infinitesimal variant, it 
holds for x — > 0. Thus, we can state that o(At) is much smaller than At. In principle, the 
above equations imply that in an infinite small time span either no or one event occurs, i.e., 
events do not arrive simultaneously: 

limP(X t > 1) = (28.133) 



The Relation between the Poisson Process and the Exponential Distribution 

It is important to know that the (time) distance between two events of the Poisson process is 
exponentially distributed (see Section 28.5.3 on page 489). The expected value of the number 
of events to arrive per time unit in a Poisson process is EX pois , then the expected value of 
the time between two events E ^ ■ Since this is the excepted value EX exp = of the 

exponential distribution, its A e2; p-value is X exp — E ^ = } = EX po i S . Therefore, the 

Aezp-value of the exponential distribution equals the A po i S -value of the Poisson distribution 
Aexp = \ois = EX po i S . In other words, the time interval between (neighboring) events of the 



See Section 30.1.3 on page 550 and Definition 30.16 on page 551 for a detailed elaboration on 
the small-o notation. 



28.4 Some Discrete Distributions 483 



Poisson process is exponentially distributed with the same A value as the Poisson process, 
as illustrated in Equation 28.134. 

Xi ~ 7T A (t(X i+1 ) - tXi) ~ exp(X) Vi e N (28.134) 
28.4.3 Binomial Distribution B(n,p) 

The binomial distribution 46 B(n,p) is the probability distribution that describes the prob- 
ability of the possible numbers successes of n independent experiments with the success 
probability p each. Such experiments is called Bernoulli experiments or Bernoulli trials. For 
n = 1, the binomial distribution is a Bernoulli distribution 47 . 

Table 28.4 points out some of the properties of the binomial distribution. A few ex- 
amples for PMFs and CDFs of different binomial distributions are given in Figure 28.5 and 
Figure 28.6. 



parameter definition 



parameters n € No, < p < 1, pel (28.135) 

PMF P(X = x) = fx(x) = (2)p x (1 - p)""" (28.136) 

CDF P(X <x) = F x (x) = Eli fx(x) = h- p (n - \x\, 1 + [x\) (28.137) 

mean EX = np (28.138) 

median med is one of {[np\ — 1, [np\, [np\ + 1} (28.139) 

mode mode = [(n + l)pj (28.140) 

variance D 2 X = np(\-p) (28.141) 

skewness 71 = /~ 2p (28.142) 
V «p(i-p) 

kurtosis 72 = (28.143) 

entropy H(X) = | In (27mep (1 - p)) + O(^) (28.144) 

mgf M x (t) = (1 -p + pe*) n (28.145) 

char. func. ^jf(t) = (1 -p+pe'*)™ (28.146) 



Table 28.4: Parameters of the Binomial distribution. 



For n — > oo, the binomial distribution approaches a normal distribution. For large n, 
B(n,p) can therefore often be approximated with the normal distribution (see Section 28.5.2) 
N(np,np(l — p)). Whether this approximation is good or not can be found out by rules of 
thumb, some of them are: 

np > 5 A n(l — p) > 5 

a ± 3cr w np ± 3-\/np(l — p) G [0, n] 

In case these rules hold, we still need to transform a continuous distribution to a discrete 
one. In order to do so, we add 0.5 to the x values, i. e., F Xl Un{ x ) ~ F x . n ormai{ x + 0-5). 



http : //en. wikipedia. org/wiki/Binomial_distribution [accessed 2007-10 01] 
http : //en. wikipedia. org/wiki/Bernoulli_distribution [accessed 2007-10 01] 
7i_ p in Equation 28.137 denotes the regularized incomplete beta function. 
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Figure 28.5: The PMFs of some binomial distributions 
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Figure 28.6: The CDFs of some binomial distributions 



28.5 Some Continuous Distributions 

In this section we will introduce some common continuous distributions. Unlike the discrete 
distributions, continuous distributions have an uncountable infinite large set of possible out- 
comes of random experiments. Thus, the PDF does not assign probabilities to certain events. 
Only the CDF makes statements about the probability of a sub-set of possible outcomes of 
a random experiment. 
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28.5.1 Continuous Uniform Distribution 

After discussing the discrete uniform distribution in Section 28.4.1, we now elaborate on its 
continuous form . 

In a uniform distribution, all possible outcomes in a range [a, 6], b > a have exactly the 
same probability. The characteristics of this distribution can be found in Table 28.5. Exam- 
ples of its probability density function is illustrated in Figure 28.7 whereas the according 
cumulative density functions are outlined Figure 28.8. 



parameter 


definition 








parameters 


a,b e R, a > b 






(28.147) 


PDF 


J K ' \ otherwise 






(28.148) 


CDF 


P(X < x) = F x (x) = jfE 




a 
a 

l 


if x < a 
if x £ [a, 6] 
otherwise 


(28.149) 


mean 


EX = \{a + b) 






(28.150) 


median 


med = | (a + b) 






(28.151) 


mode 


mode = 






(28.152) 


variance 


D 2 X = i(6 -a) 2 






(28.153) 


skewness 


7i = 






(28.154) 


kurtosis 


72 = -| 






(28.155) 


entropy 


h(X) = In (6 - a) 






(28.156) 


mgf 








(28.157) 


char. func. 








(28.158) 



Table 28.5: Parameters of the continuous uniform distribution. 
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Figure 28.7: The PDFs of some continuous uniform distributions 
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Figure 28.8: The CDFs of some continuous uniform distributions 



28.5.2 Normal Distribution N(/x, <r 2 ) 

Many phenomena in nature, like the size of chicken eggs, noise, errors in measurement, and 
such and such, can be considered as outcomes of random experiments with properties that 
can be approximated by the normal distribution 50 N(^fi,a 2 ) [2312]. Its probability density 
function, shown for some example values in Figure 28.9, is symmetric to the expected value 
fi and becomes flatter with rising standard deviation a. The cumulative density function 
is outline for the same example values in Figure 28.10. Other characteristics of the normal 
distribution can be found in Table 28.6. 



parameter 


definition 




parameters 


fj, e K,cr G R+ 


(28.159) 


PDF 




(28.160) 


CDF 


P(X<x) = F x (x) = J^j 


^ e 2^ dz (28.161) 


mean 


EX = fj, 


(28.162) 


median 


med = fi 


(28.163) 


mode 


mode = fi 


(28.164) 


variance 


D 2 X = a 2 


(28.165) 


skewness 


7i =0 


(28.166) 


kurtosis 


72 = 


(28.167) 


entropy 


h(X) = In (aV2T^e) 


(28.168) 


mgf 


M x (t) = 


(28.169) 


char. func. 




(28.170) 



Table 28.6: Parameters of the normal distribution. 



Definition 28.48 (Standard Normal Distribution). 

For the sake of simplicity, the standard normal distribution N(0, 1) with the CDF <P(x) is 
defined with fj, = and a = 1. Values of this function are listed in tables. You can compute 



http : //en . wikipedia . org/wiki/Normal_distribution [accessed 2007-07-03] 
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Figure 28.9: The PDFs of some normal distributions 
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Figure 28.10: The CDFs of some normal distributions 



the CDF of any normal distribution using the one of the standard normal distribution by 
applying Equation 28.171. 

$(x) = -= / e'^dz (28.171) 



'2tT J- 

P(X < x) =$(^-^J (28.172) 

Some values of <P{x) are listed in Table 28.7. For the sake of saving space by using two 
dimensions, we compose the values of x as a sum of a row and column value. If you want to 
look up #(2.13) for example, you'd go to the row which starts with 2.1 and the column of 
0.03, so you'd find #(2.13) « 0.9834. 
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Table 28.7: Some values of the standardized normal distribution. 



Definition 28.49 (probit). The inverse of the cumulative distribution function of the stan- 
dard normal distribution is called the probit function. It is also often denoted as z-quantile 
of the standard normal distribution. 

z^probitfe)^- 1 ^) (28.173) 
y = <P(x) => <P- X {y) = z(y) = x (28.174) 

The values of the quantiles of the standard normal distribution can also be looked up in 
Table 28.7. Therefore, the previously discussed process is simply reversed. If we wanted to 
find the value z(0.922), we locate the closest match in the table. In Table 28.7, we will find 
0.9222 which leads us to x = 1.4 + 0.02. Hence, z(0.922) w 1.42. 

The probability density function PDF of the multivariate normal distribution ' 1 [2005, 
1899, 1772] is illustrated in Equation 28.175 and Equation 28.176 in the general case (where 
£ is the covariance matrix) and in Equation 28.177 in the uncorrelated form. If the dis- 
tributions, additionally to being uncorrelated, also have the same parameters a and /i, the 
probability density function of the multivariate normal distribution can be expressed as it 
is done in Equation 28.178. 

51 http://en.wikipedia.org/wiki/Multivariate_normal_distribution [accessed 2007-07-03] 
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/x ( x ) = ^Zg-Kx-^rs-^x-M) (28.175) 



(27T) 



1 



(27r)tS2 



e -i(x- M ) T s ^x-m) (28.176) 



^ V27rcri 

n 1 2 

1=1 v 

e ^ (28.178) 



27TO- 2 



Definition 28.50 (Central Limit Theorem). The central limit theorem 52 (CLT) states 
that the sum S n — Y^i=i °f * identically distributed random variables Xi with finite 
expected values E[Xi] and non-zero variances -D 2 [JQ] > approaches a normal distribution 
for n -» +oo. [675, 1084, 2041] 



28.5.3 Exponential Distribution exp(A) 

The exponential distribution' 3 ' 5 exp(X) [556] is often used if the probabilities of lifetimes of 
apparatuses, half-life periods of radioactive elements, or the time between two events in 
the Poisson process (see Section 28.4.2 on page 482) has to be approximated. Its PDF is 
sketched in Figure 28.11 for some example values of A the according cases of the CDF are 
illustrated Figure 28.12. The most important characteristics of the exponential distribution 
can be obtained from Tabic 28.8. 



parameter definition 



parameters A e K+ (28.179) 

CDF P(X < x) = F X{X ) = { 1 _ e J x (28.181) 

mean EX = \ (28.182) 

median med = ^ (28.183) 

mode mode = (28.184) 

variance D 2 X = (28.185) 

skewness 71 = 2 (28.186) 

kurtosis 72 = 6 (28.187) 

entropy h(X) = 1 - In A (28.188) 

mgf M x (t) = (1 - {y 1 (28.189) 

char. func. <p x {t) = (l - ^) _1 (28.190) 



Table 28.8: Parameters of the exponential distribution. 



http : //en . wikipedia . org/ wiki/Central_limit_theorem [accessed 2008-08-19] 
http : //en. wikipedia. org/wiki/Exponential_distribution [accessed 2007-07-03] 
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Figure 28.11: The PDFs of some exponential distributions 




Figure 28.12: The CDFs of some exponential distributions 



28.5.4 Chi-square Distribution 

The chi-square (or x 2 ) distribution 04 is a steady probability distribution on the set of pos- 
itive real numbers. It is a so-called sample distribution which is used for the estimation of 
parameters like the variance of other distributions. We can also describe the sum of indepen- 
dent standardized normal distributions with it. Its sole parameter, n, denotes the degrees of 
freedom. 

In Table 28. 9 55 , the characteristic parameters of the % 2 distribution are outlined. A few 
examples for the PDF and CDF of the \ 2 distribution are illustrated in Figure 28.13 and 
Figure 28.14. 

54 http://en.wikipedia.org/wiki/Chi-square_distribution [accessed 2007-09-30] 

55 7(n, z) in Equation 28.193 is the lower incomplete Gamma function and Pj(n, z) is the regularized 
Gamma function. 
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Table 28.10 provides some selected values of the \ 2 distribution. The table's headline 
contains results of the cumulative distribution function Fx(x) of a x 2 distribution with n 
degrees of freedom (values in the first column) . The cells now denote the x values that belong 
to these (m, Fx(x)) combinations. 
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definition 




parameters 


n <E K+, ?i > 


(28.191) 


PDF 


f if x < 

MX) = 1 rf^)^- 1 ^ 2 otherwise 


(28.192) 


CDF 


P(X < x) = F x (x) = = P 7 (n/a, 


(28.193) 
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EX = n 
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med w n — | 


(28.195) 


mode 
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D 2 X = 2n 
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entropy 


h(X) = f + In (2r(»/ 2 )) + (1 - ™/2)V'( n /2 


(28.200) 


mgf 


Mx(i) = (1 - 2t)~" /2 /or 2t < 1 


(28.201) 


char. func. 


<ox(t) = (l-2ii)"' 1 / 2 


(28.202) 



Table 28.9: Parameters of the \ 2 distribution. 




492 28 Stochastic Theory and Statistics 




Figure 28.14: The CDFs of some % 2 distributions 
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.10: Some values of the % 2 distribution. 
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28.5.5 Student's t-Distribution 

The Student's t-distribution 56 is based on the insight that the mean of a normally distributed 
feature of a sample is no longer normally distributed if the variance is unknown and needs 
to be estimated from the data samples [840, 841, 679]. It has been design by Gosset [840] 
who published it under the pseudonym Student. 

The parameter n of the distribution denotes the degrees of freedom of the distribution. 
If n approaches infinity, the t-distribution approaches the standard normal distribution. 

The characteristic properties of Student's t-distribution are outlined in Table 28. II 57 
and examples for its PDF and CDF are illustrated in Figure 28.15 and Figure 28.16. 

Table 28.12 provides some selected values for the quantiles of the t-distribution 

(one-sided confidence intervals, see Section 28.7.3 on page 503). The headline of the table 
contains results of the cumulative distribution function Fx{x) of a Student's t-distribution 
with n degrees of freedom (values in the first column). The cells now denote the x values 
that belong to these (n,Fx{xj) combinations. 



parameter definition 



parameters n € R+, n > (28.203) 

PDF M*) = 7S^(l + * 2 Ar <,,+1,/2 2 (28.204) 

CDF P(X < x) = F x (x) = !+xr(=±i) 2Fl (^_^ ( ;j'~^) (28.205) 

mean EX = " (28.206) 

median med = (28.207) 

mode mode = (28.208) 

variance D 2 X =^/orn>2, otherwise undefined (28.209) 

skewness 71 = for n > 3 (28.210) 

kurtosis 72 = ^4 for n > 4 (28.211) 

entropy h(X) = f ty(*±i) - V(f )] + log [^B(f , §)] (28.212) 

mgf undefined (28.213) 



Table 28.11: Parameters of the Student's t- distribution. 



http : //en. wikipedia. org/wiki/Student/,27s_t- distribution [accessed 2007-09-30] 
More information on the gamma function _T used in Equation 28.204 and Equation 28.205 can 
be found in Section 28.10.1 on page 532. 2F1 in Equation 28.205 stands for the hypergeometric 
function, tp and B in Equation 28.212 are the digamma and the beta function. 
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Figure 28.16: The CDFs of some Student's t-distributions 
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Table 28.12: Table of Student's t-distribution with right-tail probabilities. 
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28.6 Example — Throwing a Dice 



Let us now discuss the different parameters of a random variable at the example of throwing 
a dice. On a dice, numbers from one to six are written and the result of throwing it is the 
number written on the side facing upwards. If a dice is perfect, the numbers one to six will 
show up with exactly the same probability, | . The set of all possible outcomes of throwing 
a dice Q is thus 

Q= 0} (28.214) 

We define a random variable X : fl i— > M that assigns real numbers to the possible 
outcomes of throwing the dice in a way that the value of X matches the number on the dice: 



X : n h-> {1,2,3,4,5,6} 



(28.215) 



It is obviously a uniformly distributed discrete random variable (see Section 28.4.1 on 
page 479) that can take on six states. We can now define the probability mass function PMF 
and the according cumulative distribution function CDF as follows (see also Figure 28.17): 



F x (x) = P(X<x) 
fx(x)=P(X = x) 



if x < 1 

| if 1< x < 6 

6 — — 

1 otherwise 



if x < 1 
if 1 < x < 6 




l 

6 

otherwise 



(28.216) 
(28.217) 
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Figure 28.17: The PMF and CMF of the dice throw 



We now can discuss the statistical parameters of this experiment. This is a good oppor- 
tunity to compare the real parameters and their estimates. We therefore assume that the 
dice was thrown ten times (n = 10) in an experiment. The following numbers have been 
thrown as illustrated in Figure 28.18): 

A = {4,5,3,2,4,6,4,2,5,3} (28.218) 

Table 28.13 outlines how the parameters of the random variable are computed. The real 
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Figure 28.18: The numbers thrown in the dice example 



parameter true value estimate 



count 


non existent 






n = len(A) = 10 






(28.219) 


minimum 


a — min {x : fx 


(x) > 0} = 


1 


a = min A = 2 w a 






(28.220) 


maximum 


b — max {x : fx 


(x) > 0} = 


6 


fo = max yl = 6 ~ & 






(28.221) 


range 


range = r — b — 


a + 1 = 6 




r = b — a + 1 = 6 « rang 


e 




(28.222) 


mean 




= 3.5 




1 ^ = 19 = 


3.8 f 


« EX 


(28.223) 


median 


med = = I 


= 3.5 




A s = sortList a (A, >) 


= 4 p= 


med 


(28.224) 


mode 


mode = 






mode = {4} w mode 






(28.225) 


variance 




ff « 2.917 






) 2 = 


ff » 1.73 p 


i D 2 X (28.226) 


skcwness 


7i =0 






Gi » 0.0876 » 7i 






(28.227) 


kurtosis 


6(r 2 + l) 


_ 222 ^ 
~~ 175 ~ 


-1.269 G 2 « -0.7512 « 72 






(28.228) 



Table 28.13: Parameters of the dice throw experiment. 



values of the parameters are defined using the PMF or CDF functions, while the estimations 
are based on the sample data obtained from our experiment solely. 

As you can see, the estimations of the parameters sometimes differ significantly from 
their true values. More information about estimation can be found in the following section. 
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28.7.1 Introduction 

Estimation theory is the science of approximating the values of parameters based on mea- 
surements or otherwise obtained sample data [1727, 1105, 1668, 2100, 1882]. The center of 
this branch of statistics is to find good estimators in order to approximate the real values 
the parameters as good as possible. 

Definition 28.51 (Estimator). An estimator' 8 9 is a rule (most often a mathematical 
function) that takes a set of sample data A as input and returns an estimation of one 
parameter 9 of the random distribution of the process sampled with this data set. 

We have already discussed some estimators in Section 28.3 - the arithmetic mean of 
a sample data set (see Definition 28.29 on page 473) for example is an estimator for the 
expected value (see Definition 28.27 on page 473) and in Equation 28.67 on page 474 we 
have introduced an estimator^ for the sample variance. 

Obviously, the estimator 9 is the better the closer its results (the estimates) come to the 
real values of the parameter 9. 

Definition 28.52 (Point Estimator). We define a point estimator 9 to be an estimator 
which is a mathematical function 9 : R™ i— ► R. This function takes the data sample A (here 
considered as a real vector Ael") as input and returns the estimate in the form of a (real) 
scalar value. 

Definition 28.53 (Error). The absolute (estimation) error e 59 is the difference between 
the value returned by a point estimator 9 of a parameter 9 for a certain input A and its real 
value. Notice that the error e can be zero, positive, or negative. 

e A (tj = 9(A) - 9 (28.229) 

In the following, we will most often not explicitly refer to the data sample A as basis of 
the estimation 9 anymore. We assume that it is implicitly clear that estimations are usually 
based on such samples and that subscripts like the A in ea in Equation 28.229 are not 
needed. 

Definition 28.54 (Bias). The bias Bias^^ of an estimator 9 is the expected value of 
the difference of the estimate and the real value. This mean error is null for all unbiased 
estimators. 

Bias(6>) = E 9-9 = E (28.230) 
Definition 28.55 (Unbiased Estimator). An unbiased estimator has a zero bias. 

Bias(fl) = E 6- 9 = E e(t} = E9 = 9 (28.231) 

Definition 28.56 (Mean Square Error). The mean square error 60 MSE^0^ of an es- 
timator 9 is the expected value of the square of the estimation error e. It is also the sum 
of the variance of the estimator and the square of its bias. The MSE is a measure for how 
much an estimator differs from the quantity to be estimated. 



58 http://en.wikipedia.org/wiki/Estimator [accessed 2007-07-03], http://mathworld.wolfram.com/ 
Estimator.html [accessed 2007-07-03] 

59 http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics [accessed 2007-07-03] 

60 http://en.wikipedia.org/wiki/Mean_squared_error [accessed 2007-07-03] 
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mse(<?) = e (0-0) 

MSe(#) = D 2 9 + (Bias(0 



= E 



my 



(28.232) 
(28.233) 



Notice that the MSE of unbiased estimators coincides with the variance D 2 9 of 9. For 
estimating the mean square error of an estimator 9, we use the sample mean: 



1 n 



(28.234) 



28.7.2 Likelihood and Maximum Likelihood Estimators 

Definition 28.57 (Likelihood). Likelihood 01 is a mathematical expression complementary 
to probability. Whereas probability allows us to predict the outcome of a random experiment 
based on known parameters, likelihood allows us to predict unknown parameters based on 
the outcome of experiments. 

Definition 28.58 (Likelihood Function). The likelihood function L returns a value that 
is proportional to the probability of a postulated underlying law or probability distribution 
<p according to an observed outcome (denoted as the vector y). Notice that L not necessarily 
represents a probability density /mass function and its integral also does not necessarily 
equal to 1. 

L[<p\y]<xP(y\<p) (28.235) 

In many sources, L is defined in dependency of a parameter 9 instead of the function tp. 
We preferred the latter notation since it is a more general superset of the first one. 



Observation of an Unknown Process tp 

Assume that we are given a finite set A of n sample data points. 

A = {(xi,yi) , (x 2 , J/2) , ••, {x n , y n )} , Xi, y l eRVie [1, n] (28.236) 

The Xi are known inputs or parameters of an unknown process defined by the function 
tp : R i— > WL. By observing the corresponding outputs of the process, we have obtained the yi 
values. During our observations, we make the measurement errors 62 r\i. 

y { = <p(xi) +r )l Vi:0<i<n (28.237) 
About this measurement error 77 we make the following assumptions: 

Er] = (28.238) 
77 - N(0, a 2 ) : < a < 00 (28.239) 
cav{m,r)j) =Q\/i,jeN:i^j,Q<i<n,0<j<n (28.240) 



1. The expected values of rj in Equation 28.238 are all zero. Our measurement device thus 
gives us, in average, unbiased results. If the expected value of rj was not zero, we could 
simple recalibrate our (imaginary) measurement equipment in order to subtract Er\ from 
all measurements and would obtain unbiased observations. 

1 http://en.wikipedia.org/wiki/Likelihood [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/Measurement_error [accessed 2007-07-03] 
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2. Furthermore, Equation 28.239 states that the rji are normally distributed around the 
zero point with an unknown, nonzero variance a 2 . To suppose measurement errors to 
be normally distributed is quite common and correct in most cases. The white noise 6 ' 
in transmission of signals for example is often modeled with Gaussian distributed 6 
amplitudes. This second assumption includes, of course, the first one: Being normally 
distributed with N(n = 0, a 2 ) implies a zero expected value of the error. 

3. With Equation 28.240, we assume that the errors rji of the single measurements are 
stochastically independent. If there existed a connection between them, it would be part 
of the underlying physical law ip and could be incorporated in our measurement device 
and again be subtracted. 



Objective: Estimation 

Assume that we can choose from a, possible infinite large, set of functions (estimators) 
f€F. 

/£f^/:RKl (28.241) 

From this set we want to pick the function f*EF with that resembles ip the best (i. e., 
better than all other / S F : / ^ /*). ip is not necessarily an element of F, so we cannot 
always presume to find a /* = ip. 

Each estimator / deviates by the estimation error e(f) (see Definition 28.53 on page 499) 
from the ^-values. The estimation error depends on / and may vary for different estimators. 

Vi = f( Xi ) + e t (f) Vz : < i < n (28.242) 

We consider all / £ F to be valid estimators for <p and simple look for the one that "fits 
best". We now can combine Equation 28.242 with Equation 28.237: 

f(xi) + £i (f) = yi = <p(xi) + m Vi : < i < n (28.243) 

We do not know <p and thus, cannot determine the rji. According to the likelihood method, 
we pick the function / £ F that would have most probably produced the outcomes tji. In 
other words, we have to maximize the likelihood of the occurrence of the £,(/). The likelihood 
here is defined under the assumption that the true measurement errors rji are normally 
distributed (see Equation 28.239). So what we can do is to determine the £j in a way that 
their occurrence is most probable according to the distribution of the random variable that 
created the rji, N(0 7 a 2 ). In the best case, the e(f*) — rji and thus, /* is equivalent to <p{xi), 
at least in for the sample information A available to us. 



Maximizing the Likelihood 



Therefore, we can regard the £»(/) as outcomes of independent random experiments, as 
uncorrelated random variables, and combine them to a multivariate normal distribution. 
For the ease of notation, we define the e(f) to be the vector containing all the single £j(/)- 
values. 

fei(f)\ 



e{f) 



S2(f) 



Vn(f)J 



(28.244) 



http://en.wikipedia.org/wiki/White_noise [accessed 2007-07-03] 
http : //en. wikipedia. org/wiki/Gaussian_noise [accessed 2007-07-03] 
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The probability density function of a multivariate normal distribution with independent 
variables that have the same variance <r 2 looks like this (as defined in Equation 28.178 on 
page 489): 

Amongst all possible vectors e(f) : f € F we need to find the most probable one 
e* = £(/*) according to Equation 28.245. The function /* that produces it will then be 
the one that most probably matches to ip. 

In order to express how likely the observation of some outcomes is under a certain set of 
parameters, we have defined the likelihood function L in Definition 28.58. Here we can use 
the probability density function fx of the normal distribution, since the maximal values of 
fx are those that are most probable to occur. 



/ 1 \ 2 £**-i(e,(/)) 2 

L[e(/)| /] = fx(e(f)) = \^— 2 ) e~ ~^ (28.246) 

r e F : L[ £ (/*)| /*] = max L[e(/)| /] (28.247) 

V Jfc-T 

/ 1 \ ' ET = i(^(/)) 2 

max - — - e ^ (28.248) 



v/eF \ 2wcr 2 

Finding a /* that Maximizes the function fx however is equal to find a /* that minimizes 
the sum of the squares of the e-values. 

n n 

/* e F : £ fe(.D) 2 = mm £ M/)) 2 (28.249) 



i=i 



According to Equation 28.242 we can now substitute the e^-values with the difference 
between the observed outcomes yi and the estimates f(xt). 

n n 

£M/)) 2 = £^-^)) 2 (28.250) 
»=i »=i 

Definition 28.59 (Maximum Likelihood Estimator). A maximum likelihood estima- 
tor 60 [37] /* is an estimator which fits with maximum likelihood to a given set of sample 
data A. Under the particular assumption of uncorrelated error terms normally distributed 
around zero, a MLE minimizes Equation 28.251. 

n n 

f* e F : £ ( Vi f{x t )f = mm £ (W - f(^)f (28.251) 

1— l i—1 

Minimizing the sum of the difference between the observed yi and the estimates /(#») 
also minimizes their mean, so with this we have also shown that the estimator that mini- 
mizes mean square error MSE (see Definition 28.56) is the best estimator according to the 
likelihood of the produced outcomes. 

n 1 n 

2— 1 2—1 

fgF: MSE(/*) = min MSE(/) (28.253) 



65 http://en.wikipedia.org/wiki/Maximum_likelihood [accessed 2007-07-03] 
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The term (yi — f(x i )) 2 is often justified by the statement that large deviations of / from 
the y-values are punished harder than smaller ones. The correct reason why we minimize 
the square error, however, is that we maximize the likelihood of the resulting estimator. 

At this point, one should also notice that the Xi also could be replaced with vectors 
x, e R m without any further implications or modifications of the equations. 

In most practical cases, the set F of possible functions is closely defined. It usually 
contains only one type of parameterized function, so we only have to determine the unknown 
parameters in order to find /*. Let us consider a set of linear functions as example. If we 
want to find estimators of the form F = {V f(x) = ax + b : a,b £ K}, we will minimize 
Equation 28.254 by determining the best possible values for a and b. 

1 " 

MSE(/(a;)|a,6) = - Yiax. + b-yi) 2 (28.254) 

i=l 

If we now could find a perfect estimator /* and our data would be free of any measurement 
error, all parts of the sum would become zero. For n > 2, this perfect estimator would be the 
solution of the over-determined system of linear equations illustrated in Equation 28.255. 

= axi + b — yi 

=ax 2 + b-y 2 (28 255) 

= ax n + b-y n 

Since it is normally not possible to obtain a perfect estimator because there are measurement 
errors or other uncertainties like unknown dependencies, the system in Equation 28.255 often 
cannot be solved but only minimized. 



Best Linear Unbiased Estimators 

The Gauss-Markov Theorem'''' defines BLUEs (best linear unbiased estimators) according 
to the facts just discussed: 

Definition 28.60 (BLUE). In a linear model in which the measurement errors e% are 
uncorrelated and are all normally distributed with an expected value of zero and the same 
variance, the best linear unbiased estimators (BLUE) of the (unknown) coefficients are the 
least-square estimators [1649]. 

Hence, for the best linear unbiased estimator also the same three assumptions 
(Equation 28.238, Equation 28.239, and Equation 28.240 on page 500) as for the maximum 
likelihood estimator hold. 



28.7.3 Confidence Intervals 

There is a very simple principle in statistics that always holds: All estimates may as well 
be wrong. There is no guarantee whatsoever that we have estimated a parameter of an 
underlying distribution correct regardless how many samples we have analyzed. However, if 
we can assume or know the underlying distribution of the process which has been sampled, 
we can compute certain intervals which include the real value of the estimated parameter 
with a certain probability. 

Definition 28.61 (Confidence Interval). Unlike point estimators, which approximate a 
parameter of a data sample with a single value, confidence intervals 67 (CIs) are estimations 
that give certain upper and lower boundaries in which the true value of the parameter will 
be located with a certain, predefined probability. [684, 351, 478] 

66 http : //en. wikipedia. org/wiki/Gauss-Markov_theorem [accessed 2007-07-03], http://www. 
answers.com/topic/gauss-markov-theorem [accessed 2007-07-03] 

67 http://en.wikipedia.org/wiki/Confidence_interval [accessed 2007-10 01] 



504 28 Stochastic Theory and Statistics 



The advantage of confidence intervals is that we can directly derive the significance of 
the data samples from them - the larger the intervals are, the less reliable is the sample. 
The narrower confidence intervals get for high predefined probabilities, the more profound, 
i. e., significant, will the conclusions drawn from them be. 



Example 



Imagine we run a farm and own 25 chickens. Each chicken lays one egg a day. We collect 
all the eggs in the morning and weigh them in order to find the average weight of the eggs 
produced by our farm. Assume our sample contains the values (in g): 



. _ { 120, 121, 119, 116, 115, 122, 121, 123, 122, 120 
119, 122, 121, 120, 119, 121, 123, 117, 118, 121 } 

n = len(A) = 20 



(28.256) 
(28.257) 



From these measurements, we can determine the arithmetic mean a and the sample 
variance s 2 according to Equation 28.60 on page 473 and Equation 28.67 on page 474: 



1 ^4 „ 2400 

a = - > A[i\ = — — = 120 
20 

i=0 

^ n— 1 

s2 = ^tE^-«) ; 



i=0 



92 
19 



(28.258) 
(28.259) 



The question that arises now is if the mean of 120 is significant, i. e., whether it likely 
approximates the expected value of the egg weight, or if the data sample was too small to 
be representative. Furthermore, we would like to know in which interval the expected value 
of the egg weights will likely be located. Now confidence intervals come into play. First, 
we need to find out what the underlying distribution of the random variable producing A 
as sample output is. In case of chicken eggs, we safely can assume 68 that it is the normal 
distribution discussed in Section 28.5.2 on page 486. With that we can calculate an interval 
which includes the unknown parameter /i (i. e., the real expected value) with a confidence 
probability of 7. 7 = 1 — a is the so-called confidence coefficient and a is the probability that 
the real value of the estimated parameter lies not inside the confidence interval. 

Let us compute the interval including the expected value (i of the chicken egg weights 
with a probability of 7 = 1 — a = 95%. Thus, a = 0.05. Therefore, we have to pick the right 
formula from Section 28.7.3 on the facing page (here it is Equation 28.272 on the next page) 
and substitute in the proper values: 



^7 e 



^95% G 



a ± ti_ 



7 -n — l ' 



120 ± to. 975, 19 * 



fi 95% e [120 ± 2.093 * 0.5048] 
^95% e [118.94,121.06] 



(28.260) 

(28.261) 

(28.262) 
(28.263) 



The value of tig. 0.025 can easily be obtained from Table 28.12 on page 496 which contains 
the respective quantilcs of Student's t-distribution discussed in Section 28.5.5 on page 494. 
Let us repeat the procedure in order to find the interval that will contain \i with probabilities 
1 - 7 = 99% a = 0.01 and 1 - 7 = 90% => a = 0.1: 



Notice that such an assumption is also a possible source of error! 
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fi 9g% G [120 ± to.995,i9 * 0.5048] 


(28.264) 


M 99 % G [120 ± 2.861 * 0.5048] 


(28.265) 


/x 99% e [118.56, 121.44] 


(28.266) 


M 90 % G [120 ±i .95,i9 * 0.5048] 


(28.267) 


fi go % G [120 ± 1.729* 0.5048] 


(28.268) 


/x 90% € [119.13, 120.87] 


(28.269) 




(28.270) 



As you can see, the higher the confidence probabilities we specify the larger become the 
intervals in which the parameter is contained. We can be to 99% sure that the expected 
value of laid eggs is somewhere between 118.56 and 121.44. If we narrow the interval down 
to [119.13,120.87], we can only be 90% confident that the real expected value falls in it 
based on the data samples which we have gathered. 



Some Hand-Picked Confidence Intervals 

The following confidence intervals are two-sided, i. e., we determine a range 1 



9'-x,6' + x 



that contains the parameter with probability 7 based on the estimate 



or 



0' 



x, 00 



0. If you need a one-sided confidence interval like 6> 7 G oo,0 + x 
you just need to replace 1 — | with 1 — a in the equations. 

Expected Value of a Normal Distribution N(fJ,, cr 2 ) 

With knowing the variance a 2 : If the exact variance a 2 of the distribution underlying our 
data samples is known, and we have an estimate of the expected value ji by the arithmetic 
mean a according to Equation 28.60 on page 473, the two-sided confidence interval (of 
probability 7) for the expected value of the normal distribution is: 



M 7 e 



a ± 




a 


-1) 











(28.271) 



Where z(y) = probit(y) = <P 1 y is the y-quantil of the standard normal distribution 
(see Definition 28.49 on page 488) which can for example be looked up in Table 28.7. 

With estimated sample variance s 2 : Often, the true variance a 2 of an underlying distribution 
is not known and instead estimated with the sample variance s 2 according to Equation 28.67 
on page 474. The two-sided confidence interval (of probability 7) for the expected value can 
then be computed using the arithmetic mean a and the estimate of the standard deviation 
s = Vs 2 of the sample and the £ n _i ; i_a quantile of Student's t-distribution which can be 
looked up in Table 28.12 on page 496. 



M 7 e 



a ± t\- 



(28.272) 



Variance of a Normal Distribution 

The two-sided confidence interval (of probability 7) for the variance of a normal distribution 
can computed using sample variance s 2 and the x 2 (p, A:)-quantile of the \ 2 distribution which 
can be looked up in Table 28.10 on page 493. 
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Success Probability p of a B(l,p) Binomial Distribution 

The two-sided confidence interval (of probability 7) of the success probability p of a B(l,p) 
binomial distribution can be computed as follows: 



p 1 G 



n + z? 



2n 



± 



'3(1 



2n 



(28.274) 



Expected Value of an Unknown Distribution with Sample Variance 



The two-sided confidence interval (of probability 7) of the expected value EX of an unknown 
distribution with an unknown real variance D 2 X can be determined using the arithmetic 
mean a and the sample variance s 2 if the sample data set contains more than n = 50 
elements. 

EXj G a±z(l-%) (28.275) 



Confidence Intervals from Tests 

Many statistical tests (such as the Wilcoxon's signed rank test introduced in Section 28.8.1) 
can be inverted in order to obtain confidence intervals [205]. The topic of statistical tests 
are discussed in Section 28.8. 



28.7.4 Density Estimation 

In this section we discuss density estimation 69 techniques [185, 1845]. Density estimation is 
often used by global optimization algorithms in order to test whether a certain region of 
the search space has already been explored sufficiently and where to concentrate the further 
search efforts. 

Definition 28.62 (Density Estimation). A density estimation p(a) approximates an 
unobservable probability density function fx{ a ) ( see Section 28.2.8 on page 472) using a set 
of sample data A. 

p{a)^f x (a) (28.276) 
p : A -> R+ (28.277) 

Histograms 



TODO 

The k th Nearest Neighbor Method 

Definition 28.63 (k*' 1 Nearest Neighbor Distance). The k th nearest neighbor distance 
function dist£ n k denotes the distance of one element a to its k th nearest neighbor in the set 
of all elements A. It relies on a distance measure (here called dist) to compute the element 
distances. See Section 29.1 on page 537 for more details on distance measures. 

dist£ nfe (dist, a, A) = dist (a, a k ) : |V6 G A : dist (a, b) < dist (a, a h )\ = k - 1 (28.278) 

69 http://en.wikipedia.org/wiki/Density_estimation [accessed 2007-07-03] 
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Using the k th nearest neighbor method [1879], the probability density function of an 
element a is estimated by its distance to its k th nearest neighbor in the test set A (with 
k < \A\). Most often, the k th nearest neighbor distance measure internally uses the Euclidian 
distance measure dist etl ci = dist Ki 2 (see Definition 29.8 on page 538), but theoretically any 
other one of the distance measures presented in Section 29.1 could also be applied. Normally, 
k is chosen to be \/\A\ 

Crowding Distance 

Crowding distance [542] treats every element a e A as n-dimensional vector (where each 
dimension will represent an objective subject to optimization in the context of this book). 
The crowding distance is not a distance measure as its name may suggest, but a base for a 
density estimate. When computing the crowding distance of an element a we consider every 
single dimension i of the element a separately. For each of its dimensions, we determine 
the nearest neighbor to the left a 1 and the nearest neighbor to the right A r . The crowding 
distance of the element a in the dimension i is then a\ — a\, the distance of the (objective) 
values of the right and left neighbors of a in the dimension i. This distance is normalized 
so that the maximum crowding distance of all elements in A in any dimension is 1. If an 
element has no left or no right neighbor in this dimension, meaning that it is situated on 
either end of the spectrum represented by all elements in the sample A, its crowding distance 
in the dimension is also set to 1. 

The original source [542] does not mention normalization explicitly and sets the crowding 
distance of edge elements to oo, which is both problematic. If no normalization is performed, 
dimensions with large crowding distances will outweigh those with smaller values - they 
will play no role in the crowding density value finally computed. With normalization, each 
dimension has the same weight. If the crowding distance of edge elements is set to oo, they 
will have a very outstanding position in A which could influence processes relying on the 
crowding distance in a very strong way. 

The total crowding distance of an element a is the sum of its distance values correspond- 
ing to each dimension. Algorithm 28.1 on the following page computes a function dist£ r (a, A) 
which relates each element a of the set A to its crowding distance. In this algorithm, we con- 
sider dist£ r to be some sort of lookup-table instead of a mathematical function. Therefore, we 
can build it iteratively by summing up the distance values dimension- wise. Since computing 
the crowding distance can be performed best by sorting the individuals according to their 
values in the single dimensions, we define the comparator function 70 cmp 4 ab as follows: 

!-l if Oj < bi 
if a, = b t Vo, b e A, Mi e [0, \a\] (28.280) 
1 otherwise 

The crowding distance can now be used as density estimate whereas individuals with 
large crowding distance values are in a sparsely covered region while small values of dist£ r 
indicate dense portions of A. A density estimate derived from the crowding distance will 
therefore be inversely proportional to it. Hence, we define the density measure p cr as the 
difference of 1 and dist£ r (a, A) divided by the vector dimensions n = |a|, obtaining a value 
in [0, 1] that is big if a is in crowded region and small if it is situated in a sparsely covered 
area of A. It should be noted that this density estimate is mathematically not fully sound 
since it only displays the crowding information. 

Pcr (a,A) = l- d[stPcr{a > A) (28.281) 



Comparator functions where introduced in Definition 1.15 on page 38. 
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Algorithm 28.1: dist£ r (. . ., A) < — computeCrowdingDistance(a, A) 
Input: A: the set of sample data 

Data: dd: a list used as store for the crowding distances of the single dimensions 
Data: A„: the list representation of A 
Data: dim: the dimension counter 
Data: j: the element counter 

Data: max: the maximum crowding distance of the current dimension 
Output: distj? r (. . . ,A): the crowding distance function 

l begin 



2 


dd^ 


— createList(len(^4) , 0) 


3 


dd[o] 




1 


4 


dd[len(A)- 


-i]<— 1 


5 


A s < — setToList(^) 


6 


dim < 




n 


IT 


while dim > do 


8 




As < — sortList a (A, cmp dim ) 


9 




max 


< — 


10 




3 




len(A) - 2 


11 




while j > do 


12 






dd[j] < A s [j + l}dim — A a [j-l\dim 


13 






if dd[j] > max then max < — dd[j 


14 






_ j 


3 ~ 1 


15 




if max > then 


16 






j 


i — len(A) - 2 


17 






while j > do 


18 








dd[j] < — ^ 

LJ 1 max 


19 








_ i i - 1 


20 




3 




len(A) - 1 


21 




while j > do 


22 






distP r (A s [j], A) < — distg r (A s [i], A) 


23 






_ j 




24 




dim ' 


dim — 1 


25 


return dist£ r (. . . , A) 


26 end 









Parzen Window / Kernel Density Estimation 

Another density estimation is the Parzen [1618] window method' 1 , also called kernel density 
estimation. 

TODO 



28.8 Statistical Tests 

With statistical tests [874, 1866, 1878, 898, 1274, 478], it is possible to find out whether 
an alternative hypothesis Hi about the distribution(s) from a set of measured data A is 
likely to be true. This is done by showing that the sampled data would very unlikely have 



http : //en. wikipedia. org/wiki/Parzen_window [accessed 2007-07-03] 
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occurred if the opposite hypothesis, the null hypothesis H , holds. If we want to show, for 
instance, that two different settings for an evolutionary algorithm will probably lead to 
different solution qualities (ifi), we assume that the distributions of the objective values of 
the solution candidates returned by them are equal (Hq). Then, we run the two evolutionary 
algorithms multiple times and measure the outcome, i. e., obtain A. Based on A, we can 
estimate the probability a with which the two different sets of measurements (the samples) 
would have occurred if H was true. In the case that this probability is very low, let's say 
a < 5%, H can be rejected (with 5% probability of making a type I error) and Hi is likely 
to hold. Otherwise, we would expect H to hold and reject Hi. 

Neyman and Pearson [1522, 1523] distinguish two classes of errors 72 that can be made 
when performing hypothesis tests: 

Definition 28.64 (Type 1 Error). A type 1 error (a error, false positive) is the rejection 
of a correct null hypothesis Hq, i. e., the acceptance of a wrong alternative hypothesis Hi. 
Type 1 errors are made with probability a. 

Definition 28.65 (Type 2 Error). A type 2 error (/? error, false negative) is the accep- 
tance of a wrong null hypothesis Hq, i. e., the rejection of a correct alternative hypothesis 
Hi. Type 2 errors are made with probability (3. 

Definition 28.66 (Power). The (statistical) power 73 of a statistical test is the probability 
of rejecting a false null hypothesis H n . Therefore, the power equals 1 — (3. 

A few basic principles for testing should be mentioned before going more into detail: 

1. The more samples we have, the better the quality and significance of the conclusions 
that we can make by testing. An arithmetic mean of the runtime 7s is certainly more 
significant when being derived from 1000 runs of certain algorithm than from the sample 
set A = {9s, 5s}. . . 

2. The more assumptions that we can make about the sampled probability distribution, 
the powerful will the tests be that are available. 

3. Wrong assumptions, falsely carried out measurements, or other misconduct will nullify 
all results and efforts put into testing. 

In the following, we will discuss multiple methods for hypothesis testing. We can dis- 
tinguish between tests based on paired samples and those for independent populations. In 
Table 28.14, we have illustrated an example for the former, where pairs of elements (a, b) 
are drawn from two different populations. Table 28.15 contains two independent samples a 
and b with a different number of elements (n a = 6 ^ rib = 8). 

28.8.1 Non-Parametric Tests 

All previously discussed estimation or testing methods have one thing in common: we have 
to know or to assume the type of distribution which drives the sampled process. When this 
distribution is known, everything is sweet. If we have to assume the distribution, we may 
make an error. The possibility of an error obviously renders the probabilities that we define 
for the tests or confidence intervals more or less useless. Additionally, there are cases where 
we either have no idea at all about the distribution in question or where it is questionable 
whether one of the distributions known to us (see, for instance, Section 28.4 and Section 28.5 
for reference) does fit to the behavior of the observed process sufficiently good. 

Non-parametric statistics 74 [1878, 1866, 252] is the branch of statistics focusing on the 
group of methods that make only extremely few assumptions about the distribution from 

72 http://en.wikipedia.org/wiki/Type_I_and_type_II_errors [accessed 2008-08-15] 

73 http://en.wikipedia.org/wiki/Statistical_power [accessed 2008-08-15] 

74 http://en.wikipedia.org/wiki/Non-parametric_statistics [accessed 2008-08-15] 
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Row 




D 


tZ — 


W (A 


Sign 


rvciiiK 


\r\ Rank r 


i. 


2 


10 




+8 


+ 




13 


13 


2. 


3 


4 




+1 


+ 




2 


2 


3. 


6 


10 




+4 


+ 




10 


10 


4. 


4 


6 




+2 


+ 




6 


6 


5. 


6 


11 




+5 


+ 




11 


11 


6. 


5 


6 




+ 1 


+ 




2 


2 


7. 


4 


11 




+7 


+ 




12 


12 


8. 


9 


6 




-3 






9 


-9 


9. 


10 


12 




+2 


+ 




6 


6 


10. 


8 


8 















11. 


6 


8 




+2 


+ 




6 


6 


12. 


7 


6 




-1 






2 


-2 


13. 


4 


4 















14. 


4 


6 




+2 


+ 




6 


6 


15. 


9 


7 




-2 






6 


-6 




E ffl * = 


87 £&i = 115 


D = 




= 28 


R 


= En 


= 57 



med(a) = 6; a = 5.8 
med(6) = 7; b = 7.67 



Table 


28.14: 


Example 


for paired sample 


» (a, 6). 


Row 


a 


b 


Ranks r a 


Ranks rb 


1. 


2 




1.5 




2. 




2 




1.5 


3. 


3 




4.0 




4. 


3 




4.0 




5. 


3 




4.0 




6. 


4 




6.0 




7. 




5 




8.5 


8. 


5 




8.5 




9. 




5 




8.5 


10. 




5 




8.5 


11. 




6 




11.5 


12. 




6 




11.5 


13. 




7 




13.5 


14. 




7 




13.5 


med(a) = 


3 med(6) 


= 5.5 i? a = 28 


R b = 77 




n a — 6 


rib = 


= 8 





Table 28.15: Example for unpaired samples. 



which the data has been sampled. Many of the density estimation methods which we will 
discuss in Section 28.7.4 belong to this group, for instance. Here, we will concentrate on 
non-parametric tests which allow us to verify hypothesis on data samples with unknown 
underlying distribution. 

Sign Test 

The sign test 75 [874, 1878] is used for checking whether the differences in the medians of 
paired samples from continuous distributions are significant. This test is especially useful, 
for instance, when we have before-after or with-and-without-types of sample pairs (a, b) and 
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can (only) measure the changes between them. The null hypothesis H is that there is no 
difference between the medians of the distributions generating the elements a and b. The 
alternative hypothesis Hi is that such a difference exists. 

An example for this situation has been given in Table 28.14 on the facing page. The first 
step of applying the sign test is to reduce the measurement pairs (a, b) to + (if a < b), = (if 
a = 6), and to - (if a > b), as done in the fifth column of Table 28.14. Then, the number of 
+ and - in A' are counted. 

n+ = \{a e A : a = +}| (28.282) 
n~ = \{a e A : a = -}| (28.283) 

In Tabic 28.14, n + = 10 and n~ = 3. In the following, the samples with = are ignored, 
setting the total of "interesting" samples to n = 13. The motivation is that if the underlying 
distributions are continuous, the chance of drawing two similar elements = bi (with 
difference di = 0) is also and such measurements thus result from imprecision. On one 
hand, this makes complete sense, since these samples would have been either + or - with 
more precise measurement equipment and now we cannot determine to which group they 
belong anymore. On the other hand, by simply discarding these samples, we also discard 
information which supports the null hyporthesis H . This is a weakness of the sign test. 

In the ideal case if Hq holds, i. e., if the medians of the distributions of a and b are 
equal, the probability that one row in the Table 28.14 contains a + is exactly the same that 
it would contain a -, both are 0.5. Rarely one will encounter such an ideal situation in the 
data samples, so we need to find out how significant the 10 : 3 ratio in our sample is. With 
the binomial distribution (discussed in Section 28.4.3), we can determine the probability 
that n + (rcsp. n~) or more extreme numbers of + (-) would occur under Hq. 

a = P{x < min{n + ,n~}) + P(x > max{n + ,n~}) = 
2P(a; < min{n+,n-}) =2^ ( n )0.5" = 



2P(x > max{n+,n-}) = 2^ (J - 5 ™ (28.284) 

i-max{rt+ ,n — } 



Notice that this corresponds to computing a two-sided probability with the CFG of the 
binomial distribution (see Equation 28.137) with the parameters n and p = 0.5. In our 
example, we would compute: 



i=0 



3 '13 



0.5 W w 0.0923 (28.285) 



Data samples at least as extreme as our measurements could occur with a probability of 
approximately 9%. On a significance level of 5%, we cannot reject Hq. Hence, there is not 
enough information to believe in Hi according to the sign test. 



Randomization Test 

Randomization tests 76 for equality of expected values have first been suggested by Fisher 
[683] in 1936 [252, 619, 1977]. They not only take into consideration the signs of the differ- 
ences, but also the sum D of differences themselves. In our example from Table 28.14, the 
total difference D between the a and b is 28. 

The null hypothesis Hq is again that all samples have been drawn from the same popula- 
tion and thus, their expected values are the same. For each single pair (di,bi) in Table 28.14, 
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we have computed the difference di — bi — <Xi in the fourth column. If Hq holds, the prob- 
ability that we would measure di is exactly the same as for measuring — rfj, since drawing 
the pair (a^,^) is as same as probable that drawing (6j,aj). 

Leaving the zero differences (di = 0) out of consideration, we obtain the following com- 
binatorial considerations from [252]: If our measured data consists of only one pair (ai, 61), 
under Hq, 2 1 = 2 differences are possible: (d\ =61 — 01 or d\ = ai — 61) and both have 
the same probability Y2. For n = 2 pairs, the difference signs can occur in 2 2 = 4 ways, 
{ , — h, H — , ++}, each having probability 1/4. For n = 3, 2 3 = 8 possible sign configura- 
tions with probability i/s can emerge { , h, — I — , — h+, H , H h, ++ — , +++}. 

Generally, if we leave the pairs intact and exchange only their members, there are 2™ possible 
+/- arrangements for n pairs. 

For all these arrangements, we compute the absolute value of the corresponding total 
difference D' and count the number n e of differences that are more extreme than the absolute 
value of D. "Extreme" means either larger or smaller than D, depending on the sample 
distribution. The absolute values of the differences are used since the test of Hq is basically 
a two-sided test. The probability a of the observed measurements under Hq can then be 
estimated with 

n e = min{|{£>' : D' > \D\}\ , \{D' : D' < \D\}\} (28.286) 
« = ^ (28-287) 

In other words, the more often differences more extreme than the initial D occur, the more 
evidence is given for Hq . If, on the other hand, only very few configurations with differences 
as extreme as D exist, a becomes very small and we can reject Hq. 
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Possible Difference of a Permutation 
ure 28.19: The randomization test applied to the example from Table 28.14. 



Let us apply this procedure to the example given in Table 28.14. There are n = 13 pairs 
with non-zero difference in our samples, leading to a total of 8192 configurations. You can 
find these 8192 values illustrated in a histogram in Figure 28.19. From this histogram, we 
can see that there are 344 permutations with a difference at least as extreme as the sampled 
difference. Hence, a — 43 /i024 ps 0.042 = 4.2% and, under a significance of 5%, the null 
hypothesis Hq can be rejected. 
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The randomization test has a higher power than the sign test, but comes with the 
additional assumption that the measured sample represents the population(s) and the un- 
derlying distribution(s) sufficiently well and that the samples are pair-wise independent. 
The populations do not necessarily need to be homogeneous. Different than the sign test, 
the randomization test does not require the sampled distributions to be continuous and, 
thus, zero differences are possible. They would play no role in the computation since the 
same results with come out with and without them [252]. Leaving them away makes sense 
since it reduces the number of combinations which have to be tested. If we expect a high 
probability of outliers, the randomization test is maybe not the method of choice and we 
would prefer the sign test. Notice furthermore that computing all possible differences may 
become computationally intense for n > 128 data samples. . . 

Signed Rank Test 

Wilcoxon's signed rank test 77 [2220] basically works exactly the same as the randomization 
test except that it replaces the difference sums with difference ranks. The null hypothesis 
H is that the average rank of the samples in the pairs are equal. The alternative hypothesis 
Hi is that there is a difference. The ranks are computed as follows: 

1. First, the differences di between the elements hi and <ij of the sample pairs (a^,^) have 
to be determined (the fourth column in Table 28.14). 

2. The zero differences (di = 0) are discarded and only the remaining n samples are con- 
sidered in the test. 

3. The absolute values \di\ of these differences are sorted in ascending order. 

4. Each absolute value di is assigned a rank |r^| corresponding to its position in this list. 

Rows + 1, . . . , i + m with equal absolute differences = \d i+ i\ = ■ ■ ■ = \d i+m \ share 

the same absolute rank \n\ = \r l+1 \ = .. = \r l+m \ = * +(m) +; i +(t+m) = f + i which 

is determined by averaging, fractional ranks such as 3.5 are possible. The difference 1, 

for instance, occurs three times in the second to last (unsorted) column of Table 28.14: 

= \d(i\ = \di2\ = 1. All three rows received the same absolute rank |r 2 | = |r 6 | = 
| ri2 | = 1+1+3 = 2 . 

5. The sign that has been stripped from the differences di is re-attached to the ranks, i. e., 
t% = \ri\* sign(rfj). We have applied this ranking scheme to the example in Table 28.14 
in the last table column. 

The rank sum R of the initial sample is determined by adding up all the signed ranks 
ri of the in-pair differences. Now the distribution of the absolute values of the 2™ possible 
rank sums R 1 arc computed with the same method with which the distribution of possible 
difference sums is determined in the randomization test (see Section 28.8.1). Amongst these, 
we count the number n e of rank sums at least as extreme as R and the probability a that 
the ranks samples would have been measured then is determined with Equation 28.289. 

n e = min{|{i?' : R' > \R\}\ , \{R' : R' < \R\}\} (28.288) 
a=^ (28.289) 

When we apply this procedure to our example from Table 28.14 and illustrate the his- 
togram the absolute rank sums in Figure 28.20 (analogously to the histogram of possible 
absolute difference sums in Figure 28.19). The rank sum of the original sample is 57, as 
noted in Table 28.14. In our example, we have n = 13 non-zero differences and there are a 
total of 2™ = 8192 possible configurations. Amongst these, ranks as high as 57 and higher 
occur n e = 372 times, leading to a = 93 /2048 w 0.045. Therefore, the set of ranks which we 
have measured would have a probability of 4.5% if H would holds. In other words, at a 
significance level of 5%, we can reject H and assume the alternative hypothesis Hi. 
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Figure 28.20: The histogram of possible absolute rang sums in Table 28.14. 



By neither weighting every difference the same nor giving outliers too big of a chance 
to influence, the outcome of the test, the signed rank test lies somewhat between the sign 
and the randomization test [252]. When applying the signed rank test, we do not need to 
assume that the sampled data represent the true distributions with high precision. At first 
glance, this test has the same drawback as the randomization test: in order to get to the 
a — 93 /2048, we need to compute all the 2™ = 8192 possible rank combinations (or, at least 
half of them). For larger n, this will not work properly - n = 32, for instance 4 294 967 296 
iterations, and for n = 100, we would have to check 1.3 * 10 30 combinations. 

Different from the randomization test though, the rank numbers are finite and we know, 
for instance, their maximum value f = ■ For the signed rank approach, there exist 

tables (such as Table 28.16), where the corresponding a values are listed. In order to use 
them, we use another (equivalent) approach for computing the rank sums [1076]. First, we 
proceed exactly as before until we have computed all absolute ranks \ri\ of the absolute 
differences (see the sixth column in Table 28.14). Similar to the sign test, we compute 
the sum of ranks r + belonging to the positive and the sum of ranks r~ belonging to the 
negative differences. 

r+ = y fW if* > (28 2go) 

^ otherwise ' 

i=i 

r - = y(\n\ if* < 

^ otherwise 
i=i ( 

f = min{r + ,r"} (28.292) 

In our example, r + = 74, r~ = 17 (notice that r + — r~ = R = 57 and r + + r~ = 0.5 * n* 
(n + 1) = 91), and, thus, f = 17. In the following, three tables with the distribution of the 
Wilcoxon rank values for two-sided hypothesis can be found. The first two (Table 28.16 and 
Table 28.17) contain the critical values of f for certain a whereas the third table (Table 28.18) 
lists the exact a values for various n and f. When we want to find the significance level with 
which we can reject H , we will look up the row with n = 13 Table 28.16 and search the first 
cell which is greater or equal than f. We find that f < 17 a < 0.05. (If f was 16, 15, .., 13, 
we would assume the same a and for f = 12, a < 0.02 hold.) We had determined the precise 
a in our case to be 4.5%, so this fits to the table value. According to Table 28.16, we could 
reject Hq with a significance of 5%. Notice that an in a cell of Table 28.16 or Table 28.17 
means that not value of f exists for which the given significance level is fulfilled. 
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Tabic 28.18 lists the precise values of a for n € 4. .30. We can again look up our example 
by first finding the section dealing with n = 13. There, we can find the a corresponding to f. 
In each row of this section, ten values of f are listed. The first cell of row X stands for f = X, 
the second one for f = X + 1, and so on. Since f — 17, we go to the eight column to the 
second row (the row that starts with 10) and find a = 0.047 86. This value is equal to 392 /8i92, 
whereas the exact value that we have computed is smaller: ( 372 /si92). This difference results 
from the fact that in our example, there are some sample differences which share the same 
rank \r\ (row 9 and 11 in Table 28.14, for example). The table is only precise for unique ranks. 
Basically, shared sums can lead to increases as well as decreases of a. For example, if all ranks 
were 7, there are only 184 combinations for n = 13 where the ranks are at least as extreme 
as 17 (a would be 2.2%) and if the 13 ranks were (3,3,3,3,3,7,7,7, 11, 11, 11, 11, 11), there 
would be 416 combinations, i. e., a = 0.0507. Thus, the tables can only be used correctly if 
most rank numbers are indeed unique. 
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4331 


4175 


3996 


3874 


3761 


3624 


144 


4575 


4394 


4237 


4055 


3932 


3819 


3680 


145 


4641 


4458 


4299 


4116 


3991 


3877 


3737 


146 


4708 


4522 


4362 


4177 


4051 


3935 


3793 


147 


4774 


4587 


4425 


4238 


4111 


3994 


3851 


148 


4842 


4652 


4489 


4300 


4171 


4053 


3908 


149 


4909 


4718 


4553 


4362 


4232 


4113 


3967 


150 


4978 


4785 


4618 


4425 


4294 


4173 


4025 



Table 28.17: Wilcoxon's two-sided signed-rank 
Junge [1076]). 
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\ rv 

n\" 


.2 


.1 


.05 


.02 


.01 


.005 


.002 


151 


5046 


4851 


4683 


4488 


4355 


4233 


4084 


152 


5115 


4919 


4748 


4551 


4418 


4294 


4144 


153 


5185 


4986 


4814 


4615 


4480 


4356 


4204 


154 


5255 


5054 


4881 


4680 


4544 


4418 


4264 


155 


5326 


5123 


4948 


4745 


4607 


4480 


4325 


156 


5397 


5192 


5015 


4810 


4671 


4543 


4387 


157 


5468 


5262 


5083 


4876 


4736 


4606 


4448 


158 


5540 


5332 


5151 


4943 


4801 


4670 


4511 


159 


5613 


5402 


5220 


5009 


4866 


4734 


4573 


160 


5686 


5473 


5289 


5077 


4932 


4799 


4636 


161 


5759 


5545 


5359 


5144 


4999 


4864 


4700 


162 


5833 


5617 


5429 


5213 


5065 


4930 


4764 


163 


5908 


5689 


5500 


5281 


5133 


4996 


4828 


164 


5982 


5762 


5571 


5350 


5200 


5062 


4893 


165 


6058 


5835 


5643 


5420 


5269 


5129 


4959 


166 


6134 


5909 


5715 


5490 


5337 


5196 


5024 


167 


6210 


5983 


5787 


5560 


5406 


5264 


5091 


168 


6287 


6058 


5860 


5631 


5476 


5332 


5157 


169 


6364 


6133 


5934 


5703 


5546 


5401 


5224 


170 


6442 


6209 


6008 


5775 


5616 


5470 


5292 


171 


6520 


6285 


6082 


5847 


5687 


5540 


5360 


172 


6599 


6362 


6157 


5920 


5759 


5610 


5429 


173 


6678 


6439 


6232 


5993 


5831 


5681 


5497 


174 


6758 


6517 


6308 


6067 


5903 


5752 


5567 


175 


6838 


6595 


6385 


6141 


5976 


5823 


5637 


176 


6919 


6673 


6461 


6216 


6049 


5895 


5707 


177 


7000 


6752 


6538 


6291 


6123 


5967 


5778 


178 


7081 


6832 


6616 


6366 


6197 


6040 


5849 


179 


7163 


6912 


6694 


6442 


6271 


6113 


5921 


180 


7246 


6992 


6773 


6519 


6346 


6187 


5993 


181 


7329 


7073 


6852 


6596 


6422 


6261 


6065 


182 


7412 


7155 


6932 


6673 


6498 


6336 


6138 


183 


7496 


7236 


7012 


6751 


6574 


6411 


6212 


184 


7581 


7319 


7092 


6829 


6651 


6486 


6285 


185 


7666 


7402 


7173 


6908 


6728 


6562 


6360 


186 


7751 


7485 


7254 


6987 


6806 


6639 


6435 


187 


7837 


7569 


7336 


7067 


6884 


6716 


6510 


188 


7924 


7653 


7419 


7147 


6963 


6793 


6585 


189 


8010 


7738 


7502 


7228 


7042 


6871 


6662 


190 


8098 


7823 


7585 


7309 


7122 


6949 


6738 


191 


8186 


7908 


7669 


7391 


7202 


7028 


6815 


192 


8274 


7995 


7753 


7473 


7283 


7107 


6893 


193 


8363 


8081 


7838 


7555 


7364 


7187 


6971 


194 


8452 


8168 


7923 


7638 


7445 


7267 


7049 


195 


8542 


8256 


8008 


7722 


7527 


7347 


7128 


196 


8632 


8344 


8095 


7806 


7610 


7428 


7207 


197 


8723 


8432 


8181 


7890 


7692 


7510 


7287 


198 


8814 


8521 


8268 


7975 


7776 


7592 


7367 


199 


8905 


8611 


8356 


8060 


7859 


7674 


7448 


200 


8998 


8701 


8444 


8146 


7944 


7757 


7529 



distribution 2W(a,n) for n £ 101. .200 (from 
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f 


+0 


+ 1 


+2 


+3 


+4 


+5 


+6 


+7 


+8 


+9 


Precise a values for 


n = 4 



















0.125 






Precise a values for 


n = 5 



















0.0625 


0.125 


0.1875 


Precise a values for 


n — 6 



















0.03126 


0.0625 


0.09376 


0.15626 














Precise a values for 


n = 7 



















0.015626 


0.03126 


0.04688 


0.07812 


0.10938 


0.15626 










Precise a values for 


n = 8 



















0.007812 


0.015626 


0.02344 


0.03906 


0.05468 


0.07812 


0.10938 


0.14844 


0.19532 




Precise a values for 


n = 9 



















0.003906 


0.007812 


0.011718 


0.019532 


0.02734 


0.03906 


0.05468 


0.07422 


0.09766 


0.1289 


10 


0.16406 




















Precise a values for 


n = 10 



















0.0019532 


0.003906 


0.00586 


0.009766 


0.013672 


0.019532 


0.02734 


0.0371 


0.04882 


0.06446 


10 


0.08398 


0.10546 


0.13086 


0.16016 


0.19336 












Precise a values for 


n — 11 



















9.766E-4 


0.0019532 


0.00293 


0.004882 


0.006836 


0.009766 


0.013672 


0.018554 


0.02442 


0.03222 


10 


0.042 


0.05372 


0.06738 


0.083 


0.10156 


0.12304 


0.14746 


0.1748 






Precise a values for 


n — 12 



















4.882E-4 


9.766E-4 


0.0014648 


0.002442 


0.003418 


0.004882 


0.006836 


0.009278 


0.012208 


0.016114 


10 


0.021 


0.02686 


0.03418 


0.04248 


0.05224 


0.06396 


0.07714 


0.09228 


0.10986 


0.1294 


20 


0.15136 


0.17626 


















Precise a values for 


n — 13 



















2.442E-4 


4.882E-4 


7.324E-4 


0.0012208 


0.001709 


0.002442 


0.003418 


0.004638 


0.006104 


0.008056 


10 


0.010498 


0.013428 


0.01709 


0.02148 


0.02662 


0.03272 


0.0398 


0.04786 


0.05738 


0.06812 


20 


0.08032 


0.09424 


0.10986 


0.1272 


0.14648 


0.16772 


0.19092 








Precise a values for 


n = 14 



















1.2208E-4 


2.442E-4 


3.662E-4 


6.104E-4 


8.544E-4 


0.0012208 


0.001709 


0.00232 


0.003052 


0.004028 


10 


0.00525 


0.006714 


0.008544 


0.010742 


0.013428 


0.016602 


0.02026 


0.02454 


0.02954 


0.03528 


20 


0.04188 


0.04944 


0.05798 


0.06762 


0.0785 


0.09058 


0.104 


0.1189 


0.13526 


0.15308 


30 


0.1726 


0.19372 


















Precise a values for 


n = 15 



















6.104E-5 


1.2208E-4 


1.831E-4 


3.052E-4 


4.272E-4 


6.104E-4 


8.544E-4 


0.0011596 


0.0015258 


0.002014 


10 


0.002624 


0.003356 


0.004272 


0.005372 


0.006714 


0.008362 


0.010254 


0.012452 


0.015076 


0.018066 


20 


0.02154 


0.02558 


0.03016 


0.03534 


0.04126 


0.04792 


0.05536 


0.06372 


0.073 


0.08326 


30 


0.0946 


0.107 


0.12054 


0.13538 


0.15142 


0.16882 


0.18762 








Precise a values for 


n = 16 



















3.052E-5 


6.104E-5 


9.156E-5 


1.5258E-4 


2.136E-4 


3.052E-4 


4.272E-4 


5.798E-4 


7.63E-4 


0.001007 


10 


0.0013122 


0.0016784 


0.002136 


0.002686 


0.003356 


0.00418 


0.005158 


0.006286 


0.00763 


0.009186 


20 


0.010986 


0.013092 


0.015502 


0.01825 


0.0214 


0.02496 


0.029 


0.03354 


0.03864 


0.04432 


30 


0.05066 


0.05768 


0.0654 


0.07392 


0.08326 


0.09344 


0.10458 


0.11666 


0.12974 


0.14386 


40 


0.15906 


0.17536 


0.19282 
















Precise a values for 


n = 17 



















1.5258E-5 


3.052E-5 


4.578E-5 


7.63E-5 


1.0682E-4 


1.5258E-4 


2.136E-4 


2.9E-4 


3.814E-4 


5.036E-4 
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10 


6.562E-4 


8.392E-4 


0.0010682 


0.0013428 


0.0016784 


0.00209 


0.002578 


0.003158 


0.003846 


0.004638 


20 


0.00557 


0.006652 


0.007904 


0.009338 


0.010986 


0.012864 


0.015 


0.017426 


0.02016 


0.02322 


30 


0.02668 


0.03052 


0.0348 


0.03954 


0.04476 


0.05054 


0.05688 


0.06382 


0.07142 


0.07968 


40 


0.08866 


0.09838 


0.10888 


0.1202 


0.13236 


0.14544 


0.15938 


0.17426 


0.1901 




Precise a values for 


n = 18 



















7.63E-6 


1.5258E-5 


2.288E-5 


3.814E-5 


5.34E-5 


7.63E-5 


1.0682E-4 


1.4496E-4 


1.9074E-4 


2.518E-4 


10 


3.28E-4 


4.196E-4 


5.34E-4 


6.714E-4 


8.392E-4 


0.0010452 


0.0012894 


0.0015792 


0.0019302 


0.002334 


20 


0.002808 


0.003364 


0.004006 


0.004746 


0.0056 


0.006576 


0.00769 


0.008964 


0.010406 


0.012032 


30 


0.01387 


0.01593 


0.018234 


0.02082 


0.02368 


0.02684 


0.03036 


0.03424 


0.0385 


0.04316 


40 


0.04828 


0.05386 


0.05994 


0.06654 


0.07368 


0.08142 


0.08976 


0.09874 


0.10838 


0.1187 


50 


0.12974 


0.14152 


0.15404 


0.16736 


0.18146 


0.19638 










Precise a values for 


n = 19 



















3.814E-6 


7.63E-6 


1.1444E-5 


1.9074E-5 


2.67E-5 


3.814E-5 


5.34E-5 


7.248E-5 


9.536E-5 


1.2588E-4 


10 


1.6404E-4 


2.098E-4 


2.67E-4 


3.356E-4 


4.196E-4 


5.226E-4 


6.446E-4 


7.896E-4 


9.652E-4 


0.0011712 


20 


0.0014114 


0.0016938 


0.002022 


0.0024 


0.002838 


0.003342 


0.003918 


0.004578 


0.00533 


0.00618 


30 


0.007144 


0.008232 


0.009452 


0.010826 


0.01236 


0.014068 


0.015972 


0.018082 


0.02042 


0.02298 


40 


0.02582 


0.02894 


0.03234 


0.03606 


0.04014 


0.04456 


0.04936 


0.05458 


0.0602 


0.06628 


50 


0.07284 


0.07988 


0.08742 


0.09552 


0.10416 


0.11338 


0.12318 


0.13362 


0.14468 


0.1564 


60 


0.1688 


0.18186 


0.19564 
















Precise a values for 


n = 20 



















1.9074E-6 


3.814E-6 


5.722E-6 


9.536E-6 


1.3352E-5 


1.9074E-5 


2.67E-5 


3.624E-5 


4.768E-5 


6.294E-5 


10 


8.202E-5 


1.049E-4 


1.3352E-4 


1.6784E-4 


2.098E-4 


2.614E-4 


3.224E-4 


3.948E-4 


4.826E-4 


5.856E-4 


20 


7.076E-4 


8.506E-4 


0.0010166 


0.0012092 


0.0014324 


0.00169 


0.0019856 


0.002326 


0.002712 


0.003152 


30 


0.003654 


0.00422 


0.00486 


0.00558 


0.00639 


0.007296 


0.008308 


0.009436 


0.010688 


0.01208 


40 


0.013616 


0.015312 


0.017182 


0.019234 


0.02148 


0.02396 


0.02664 


0.02958 


0.03276 


0.03624 


50 


0.03998 


0.04406 


0.04844 


0.05316 


0.05826 


0.06372 


0.06958 


0.07586 


0.08256 


0.0897 


60 


0.0973 


0.1054 


0.11398 


0.1231 


0.13272 


0.1429 


0.15364 


0.16496 


0.17686 


0.18934 


Precise a values for 


n = 21 



















9.536E-7 


1.9074E-6 


2.862E-6 


4.768E-6 


6.676E-6 


9.536E-6 


1.3352E-5 


1.812E-5 


2.384E-5 


3.148E-5 


10 


4.1E-5 


5.246E-5 


6.676E-5 


8.392E-5 


1.049E-4 


1.3066E-4 


1.6118E-4 


1.9742E-4 


2.412E-4 


2.928E-4 


20 


3.538E-4 


4.262E-4 


5.102E-4 


6.074E-4 


7.21E-4 


8.516E-4 


0.0010024 


0.0011758 


0.0013742 


0.0016002 


30 


0.0018588 


0.002152 


0.002482 


0.002858 


0.003278 


0.003752 


0.004284 


0.004878 


0.005542 


0.00628 


40 


0.007102 


0.00801 


0.009016 


0.010126 


0.011346 


0.012692 


0.014166 


0.01578 


0.017546 


0.019474 


50 


0.02158 


0.02386 


0.02634 


0.02902 


0.03192 


0.03506 


0.03844 


0.04208 


0.046 


0.0502 


60 


0.0547 


0.0595 


0.06464 


0.07014 


0.07598 


0.0822 


0.0888 


0.0958 


0.10322 


0.11106 


70 


0.11934 


0.12808 


0.13728 


0.14696 


0.15714 


0.1678 


0.17898 


0.19068 






Precise a values for 


n = 22 



















4.768E-7 


9.536E-7 


1.4306E-6 


2.384E-6 


3.338E-6 


4.768E-6 


6.676E-6 


9.06E-6 


1.192E-5 


1.5736E-5 


10 


2.05E-5 


2.622E-5 


3.338E-5 


4.196E-5 


5.246E-5 


6.532E-5 


8.058E-5 


9.87E-5 


1.2064E-4 


1.4638E-4 


20 


1.769E-4 


2.132E-4 


2.556E-4 


3.046E-4 


3.62E-4 


4.282E-4 


5.044E-4 


5.928E-4 


6.938E-4 


8.092E-4 


30 


9.412E-4 


0.0010914 


0.0012618 


0.0014548 


0.0016728 


0.0019184 


0.002194 


0.002504 


0.002852 


0.00324 


40 


0.003672 


0.004152 


0.004684 


0.005276 


0.005928 


0.00665 


0.007444 


0.008316 


0.009274 


0.010324 


50 


0.011472 


0.012728 


0.014094 


0.015584 


0.0172 


0.018956 


0.02086 


0.02292 


0.02514 


0.02754 


60 


0.03012 


0.0329 


0.03588 


0.03908 


0.0425 


0.04616 


0.05008 


0.05424 


0.0587 


0.06342 


70 


0.06844 


0.07378 


0.07942 


0.0854 


0.09174 


0.09842 


0.10546 


0.11288 


0.12068 


0.12888 


80 


0.13748 


0.14652 


0.15598 


0.16586 


0.1762 


0.18698 


0.19822 








Precise a values for 


n = 23 
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2.384E-7 


4.768E-7 


7.152E-7 


1.192E-6 


1.669E-6 


2.384E-6 


3.338E-6 


4.53E-6 


5.96E-6 


7.868E-6 


10 


1.0252E-5 


1.3114E-5 


1.669E-5 


2.098E-5 


2.622E-5 


3.266E-5 


4.03E-5 


4.936E-5 


6.032E-5 


7.32E-5 


20 


8.846E-5 


1.0658E-4 


1.278E-4 


1.5258E-4 


1.8144E-4 


2.148E-4 


2.534E-4 


2.98E-4 


3.492E-4 


4.08E-4 


30 


4.752E-4 


5.518E-4 


6.388E-4 


7.376E-4 


8.494E-4 


9.758E-4 


0.0011184 


0.0012786 


0.0014584 


0.0016598 


40 


0.001885 


0.002136 


0.002416 


0.002726 


0.00307 


0.003452 


0.003874 


0.004338 


0.004852 


0.005414 


50 


0.006032 


0.00671 


0.007452 


0.008262 


0.009146 


0.01011 


0.011156 


0.012294 


0.013528 


0.014866 


60 


0.016312 


0.017872 


0.019558 


0.02138 


0.02332 


0.02542 


0.02768 


0.03008 


0.03266 


0.03544 


70 


0.03838 


0.04152 


0.04488 


0.04844 


0.05222 


0.05626 


0.06052 


0.06504 


0.06982 


0.07486 


80 


0.0802 


0.08582 


0.09176 


0.09798 


0.10454 


0.11142 


0.11864 


0.12622 


0.13414 


0.14242 


90 


0.15108 


0.1601 


0.16952 


0.17934 


0.18954 












Precise a values for n = 24 





1.192E-7 


2.384E-7 


3.576E-7 


5.96E-7 


8.344E-7 


1.192E-6 


1.669E-6 


2.264E-6 


2.98E-6 


3.934E-6 


10 


5.126E-6 


6.556E-6 


8.344E-6 


1.049E-5 


1.3114E-5 


1.6332E-5 


2.014E-5 


2.468E-5 


3.016E-5 


3.66E-5 


20 


4.422E-5 


5.328E-5 


6.39E-5 


7.63E-5 


9.084E-5 


1.0764E-4 


1.2708E-4 


1.496E-4 


1.7548E-4 


2.052E-4 


30 


2.392E-4 


2.782E-4 


3.224E-4 


3.728E-4 


4.298E-4 


4.944E-4 


5.676E-4 


6.498E-4 


7.424E-4 


8.462E-4 


40 


9.626E-4 


0.0010926 


0.001238 


0.0013998 


0.0015796 


0.0017796 


0.0020 


0.002246 


0.002516 


0.002814 


50 


0.003144 


0.003504 


0.0039 


0.004336 


0.00481 


0.00533 


0.005898 


0.006516 


0.00719 


0.00792 


60 


0.008714 


0.009576 


0.010508 


0.011516 


0.012604 


0.01378 


0.015044 


0.016406 


0.01787 


0.019442 


70 


0.02112 


0.02294 


0.02486 


0.02692 


0.02914 


0.03148 


0.03398 


0.03664 


0.03948 


0.04248 


80 


0.04568 


0.04906 


0.05264 


0.05642 


0.06042 


0.06464 


0.0691 


0.0738 


0.07872 


0.08392 


90 


0.08938 


0.0951 


0.1011 


0.10738 


0.11396 


0.12084 


0.12802 


0.13552 


0.14336 


0.1515 


100 


0.15998 


0.1688 


0.17798 


0.1875 


0.19738 












Precise a values for n = 25 





5.96E-8 


1.192E-7 


1.7882E-7 


2.98E-7 


4.172E-7 


5.96E-7 


8.344E-7 


1.1324E-6 


1.4902E-6 


1.967E-6 


10 


2.562E-6 


3.278E-6 


4.172E-6 


5.246E-6 


6.556E-6 


8.166E-6 


1.0074E-5 


1.2338E-5 


1.508E-5 


1.8298E-5 


20 


2.212E-5 


2.664E-5 


3.194E-5 


3.814E-5 


4.542E-5 


5.388E-5 


6.366E-5 


7.498E-5 


8.804E-5 


1.03E-4 


30 


1.2022E-4 


1.399E-4 


1.623E-4 


1.8788E-4 


2.17E-4 


2.498E-4 


2.87E-4 


3.29E-4 


3.764E-4 


4.296E-4 


40 


4.894E-4 


5.564E-4 


6.314E-4 


7.15E-4 


8.082E-4 


9.118E-4 


0.0010272 


0.0011548 


0.0012964 


0.0014528 


50 


0.0016254 


0.0018156 


0.002026 


0.002256 


0.002508 


0.002784 


0.003088 


0.00342 


0.00378 


0.004176 


60 


0.004604 


0.005072 


0.005578 


0.00613 


0.006726 


0.00737 


0.008068 


0.008822 


0.009636 


0.01051 


70 


0.011454 


0.012466 


0.013554 


0.014722 


0.015972 


0.017312 


0.018744 


0.02028 


0.0219 


0.02364 


80 


0.0255 


0.02748 


0.02958 


0.0318 


0.03418 


0.03668 


0.03934 


0.04216 


0.04512 


0.04826 


90 


0.05158 


0.05508 


0.05876 


0.06262 


0.0667 


0.07098 


0.07548 


0.0802 


0.08514 


0.09032 


100 


0.09574 


0.1014 


0.10732 


0.1135 


0.11994 


0.12664 


0.13364 


0.14092 


0.14848 


0.15634 


110 


0.1645 


0.17296 


0.18172 


0.19082 














Precise a values for n = 26 





2.98E-8 


5.96E-8 


8.94E-8 


1.4902E-7 


2.086E-7 


2.98E-7 


4.172E-7 


5.662E-7 


7.45E-7 


9.834E-7 


10 


1.2814E-6 


1.6392E-6 


2.086E-6 


2.622E-6 


3.278E-6 


4.082E-6 


5.036E-6 


6.17E-6 


7.54E-6 


9.15E-6 


20 


1.1056E-5 


1.3322E-5 


1.5974E-5 


1.9074E-5 


2.27E-5 


2.694E-5 


3.186E-5 


3.756E-5 


4.41E-5 


5.164E-5 


30 


6.032E-5 


7.024E-5 


8.156E-5 


9.45E-5 


1.092E-4 


1.2588E-4 


1.448E-4 


1.6618E-4 


1.9028E-4 


2.174E-4 


40 


2.48E-4 


2.822E-4 


3.208E-4 


3.636E-4 


4.116E-4 


4.65E-4 


5.246E-4 


5.908E-4 


6.642E-4 


7.454E-4 


50 


8.354E-4 


9.348E-4 


0.0010444 


0.0011652 


0.001298 


0.001444 


0.001604 


0.0017796 


0.0019716 


0.002182 


60 


0.00241 


0.00266 


0.002932 


0.00323 


0.00355 


0.0039 


0.00428 


0.00469 


0.005134 


0.005612 


70 


0.00613 


0.00669 


0.00729 


0.007938 


0.008634 


0.009382 


0.010186 


0.011046 


0.011966 


0.012952 


80 


0.014006 


0.015132 


0.016334 


0.017614 


0.018978 


0.02042 


0.02198 


0.02362 


0.02536 


0.0272 


90 


0.02916 


0.03122 


0.03342 


0.03572 


0.03816 


0.04074 


0.04346 


0.04634 


0.04934 


0.05252 


100 


0.05586 


0.05936 


0.06302 


0.06688 


0.07092 


0.07514 


0.07958 


0.0842 


0.08902 


0.09408 
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110 


0.09934 


0.10482 


0.11054 


0.11648 


0.12266 


0.1291 


0.13578 


0.14272 


0.1499 


0.15736 


120 


0.16508 


0.17308 


0.18136 


0.1899 


0.19874 












Precise a values for n = 27 





1.4902E-8 


2.98E-8 


4.47E-8 


7.45E-8 


1.043E-7 


1.4902E-7 


2.086E-7 


2.832E-7 


3.726E-7 


4.918E-7 


10 


6.408E-7 


8.196E-7 


1.043E-6 


1.3114E-6 


1.6392E-6 


2.042E-6 


2.518E-6 


3.084E-6 


3.77E-6 


4.574E-6 


20 


5.528E-6 


6.66E-6 


7.988E-6 


9.536E-6 


1.1354E-5 


1.347E-5 


1.593E-5 


1.879E-5 


2.208E-5 


2.586E-5 


30 


3.024E-5 


3.522E-5 


4.094E-5 


4.746E-5 


5.488E-5 


6.332E-5 


7.29E-5 


8.372E-5 


9.596E-5 


1.0978E-4 


40 


1.2532E-4 


1.4278E-4 


1.624E-4 


1.8434E-4 


2.088E-4 


2.364E-4 


2.668E-4 


3.008E-4 


3.388E-4 


3.808E-4 


50 


4.272E-4 


4.788E-4 


5.356E-4 


5.984E-4 


6.678E-4 


7.44E-4 


8.278E-4 


9.2E-4 


0.001021 


0.0011316 


60 


0.0012526 


0.001385 


0.0015294 


0.001687 


0.0018586 


0.002046 


0.002248 


0.002468 


0.002708 


0.002966 


70 


0.003248 


0.00355 


0.003878 


0.004232 


0.004612 


0.005024 


0.005466 


0.00594 


0.00645 


0.006998 


80 


0.007586 


0.008216 


0.008888 


0.009608 


0.010378 


0.0112 


0.012076 


0.01301 


0.014006 


0.015064 


90 


0.01619 


0.017386 


0.018656 


0.02 


0.02142 


0.02294 


0.02454 


0.02624 


0.02802 


0.0299 


100 


0.0319 


0.034 


0.0362 


0.03854 


0.04098 


0.04356 


0.04626 


0.0491 


0.05208 


0.0552 


110 


0.05848 


0.0619 


0.06548 


0.06922 


0.07314 


0.07722 


0.08148 


0.08594 


0.09056 


0.09538 


120 


0.1004 


0.10562 


0.11106 


0.11668 


0.12254 


0.1286 


0.13488 


0.1414 


0.14816 


0.15514 


130 


0.16236 


0.16982 


0.17752 


0.18548 


0.19368 












Precise a values for n = 28 





7.45E-9 


1.4902E-8 


2.236E-8 


3.726E-8 


5.216E-8 


7.45E-8 


1.043E-7 


1.4156E-7 


1.8626E-7 


2.458E-7 


10 


3.204E-7 


4.098E-7 


5.216E-7 


6.556E-7 


8.196E-7 


1.0208E-6 


1.2592E-6 


1.5422E-6 


1.885E-6 


2.288E-6 


20 


2.764E-6 


3.33E-6 


3.994E-6 


4.768E-6 


5.678E-6 


6.736E-6 


7.964E-6 


9.396E-6 


1.105E-5 


1.295E-5 


30 


1.514E-5 


1.765E-5 


2.052E-5 


2.38E-5 


2.754E-5 


3.18E-5 


3.664E-5 


4.212E-5 


4.83E-5 


5.53E-5 


40 


6.318E-5 


7.204E-5 


8.202E-5 


9.32E-5 


1.057E-4 


1.197E-4 


1.3532E-4 


1.5274E-4 


1.7214E-4 


1.9368E-4 


50 


2.176E-4 


2.442E-4 


2.736E-4 


3.06E-4 


3.418E-4 


3.814E-4 


4.25E-4 


4.73E-4 


5.256E-4 


5.834E-4 


60 


6.468E-4 


7.162E-4 


7.922E-4 


8.752E-4 


9.658E-4 


0.0010646 


0.0011722 


0.0012892 


0.0014166 


0.0015548 


70 


0.0017048 


0.0018674 


0.002044 


0.002234 


0.00244 


0.002662 


0.002902 


0.00316 


0.003438 


0.003738 


80 


0.00406 


0.004406 


0.004778 


0.005176 


0.005604 


0.00606 


0.006548 


0.007072 


0.00763 


0.008224 


90 


0.00886 


0.009536 


0.010256 


0.011024 


0.011838 


0.012704 


0.013624 


0.014598 


0.015632 


0.016728 


100 


0.017886 


0.019114 


0.0204 


0.02178 


0.02322 


0.02474 


0.02636 


0.02806 


0.02984 


0.0317 


110 


0.03368 


0.03576 


0.03792 


0.04022 


0.0426 


0.04512 


0.04774 


0.0505 


0.05338 


0.05638 


120 


0.05954 


0.06282 


0.06624 


0.06982 


0.07354 


0.07742 


0.08146 


0.08566 


0.09002 


0.09456 


130 


0.09928 


0.10418 


0.10926 


0.11452 


0.11998 


0.12562 


0.13146 


0.13752 


0.14378 


0.15024 


140 


0.1569 


0.1638 


0.1709 


0.17824 


0.18578 


0.19356 










Precise a values for n = 29 





3.726E-9 


7.45E-9 


1.1176E-8 


1.8626E-8 


2.608E-8 


3.726E-8 


5.216E-8 


7.078E-8 


9.314E-8 


1.2294E-7 


10 


1.6018E-7 


2.048E-7 


2.608E-7 


3.278E-7 


4.098E-7 


5.104E-7 


6.296E-7 


7.712E-7 


9.424E-7 


1.1436E-6 


20 


1.382E-6 


1.6652E-6 


1.9968E-6 


2.384E-6 


2.838E-6 


3.368E-6 


3.982E-6 


4.698E-6 


5.524E-6 


6.478E-6 


30 


7.578E-6 


8.836E-6 


1.0278E-5 


1.1928E-5 


1.381E-5 


1.5952E-5 


1.8388E-5 


2.114E-5 


2.428E-5 


2.78E-5 


40 


3.18E-5 


3.628E-5 


4.134E-5 


4.7E-5 


5.336E-5 


6.048E-5 


6.844E-5 


7.732E-5 


8.72E-5 


9.822E-5 


50 


1.1046E-4 


1.2406E-4 


1.3914E-4 


1.5582E-4 


1.743E-4 


1.9468E-4 


2.172E-4 


2.42E-4 


2.692E-4 


2.992E-4 


60 


3.322E-4 


3.684E-4 


4.08E-4 


4.514E-4 


4.988E-4 


5.506E-4 


6.072E-4 


6.688E-4 


7.36E-4 


8.09E-4 


70 


8.884E-4 


9.746E-4 


0.0010684 


0.0011698 


0.0012798 


0.0013988 


0.0015274 


0.0016664 


0.0018164 


0.0019782 


80 


0.002152 


0.00234 


0.002542 


0.00276 


0.002992 


0.003242 


0.00351 


0.003798 


0.004106 


0.004436 


90 


0.004788 


0.005164 


0.005566 


0.005994 


0.006452 


0.006938 


0.007456 


0.008008 


0.008594 


0.009216 


100 


0.009878 


0.010578 


0.011322 


0.01211 


0.012944 


0.013826 


0.014758 


0.015744 


0.016786 


0.017884 


110 


0.019044 


0.02026 


0.02156 


0.0229 


0.02434 


0.02584 


0.0274 


0.02906 


0.0308 


0.03262 


120 


0.03454 


0.03654 


0.03864 


0.04082 


0.04312 


0.04552 


0.04802 


0.05064 


0.05338 


0.05622 
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130 


0.0592 


0.0623 


0.06552 


0.06888 


0.07236 


0.07598 


0.07976 


0.08368 


0.08774 


0.09196 


140 


0.09634 


0.10086 


0.10556 


0.11042 


0.11546 


0.12066 


0.12604 


0.1316 


0.13732 


0.14326 


150 


0.14936 


0.15566 


0.16216 


0.16886 


0.17574 


0.18284 


0.19012 


0.19762 






Precise a values for 


n = 30 



















1.8626E-9 


3.726E-9 


5.588E-9 


9.314E-9 


1.3038E-8 


1.8626E-8 


2.608E-8 


3.54E-8 


4.656E-8 


6.146E-8 


10 


8.01E-8 


1.0244E-7 


1.3038E-7 


1.6392E-7 


2.048E-7 


2.552E-7 


3.148E-7 


3.856E-7 


4.712E-7 


5.718E-7 


20 


6.91E-7 


8.326E-7 


9.984E-7 


1.192E-6 


1.4194E-6 


1.6838E-6 


1.9912E-6 


2.348E-6 


2.762E-6 


3.24E-6 


30 


3.79E-6 


4.422E-6 


5.144E-6 


5.974E-6 


6.918E-6 


7.994E-6 


9.22E-6 


1.061E-5 


1.2184E-5 


1.3966E-5 


40 


1.5978E-5 


1.8244E-5 


2.08E-5 


2.366E-5 


2.688E-5 


3.05E-5 


3.454E-5 


3.904E-5 


4.408E-5 


4.968E-5 


50 


5.592E-5 


6.286E-5 


7.056E-5 


7.91E-5 


8.856E-5 


9.902E-5 


1.1058E-4 


1.2334E-4 


1.374E-4 


1.5288E-4 


60 


1.699E-4 


1.886E-4 


2.092E-4 


2.316E-4 


2.562E-4 


2.832E-4 


3.128E-4 


3.45E-4 


3.8E-4 


4.184E-4 


70 


4.602E-4 


5.054E-4 


5.548E-4 


6.084E-4 


6.666E-4 


7.296E-4 


7.978E-4 


8.718E-4 


9.518E-4 


0.0010382 


80 


0.0011314 


0.0012322 


0.0013406 


0.0014574 


0.0015832 


0.0017186 


0.001864 


0.00202 


0.002188 


0.002368 


90 


0.00256 


0.002766 


0.002988 


0.003222 


0.003476 


0.003744 


0.004032 


0.004338 


0.004664 


0.005012 


100 


0.005382 


0.005776 


0.006194 


0.00664 


0.007112 


0.007612 


0.008142 


0.008706 


0.009302 


0.009932 


110 


0.010598 


0.011304 


0.012048 


0.012834 


0.013664 


0.014538 


0.01546 


0.016432 


0.017454 


0.01853 


120 


0.01966 


0.02084 


0.0221 


0.02342 


0.02478 


0.02622 


0.02774 


0.02932 


0.03098 


0.03272 


130 


0.03454 


0.03644 


0.03842 


0.04048 


0.04266 


0.0449 


0.04726 


0.04972 


0.05226 


0.05492 


140 


0.05768 


0.06056 


0.06356 


0.06666 


0.0699 


0.07324 


0.07672 


0.08032 


0.08406 


0.08794 


150 


0.09194 


0.0961 


0.1004 


0.10484 


0.10944 


0.11418 


0.11908 


0.12414 


0.12936 


0.13474 


160 


0.14028 


0.146 


0.15188 


0.15794 


0.16418 


0.1706 


0.1772 


0.18396 


0.19092 


0.19808 



Table 28.18: Table with precise a- values for n € 4.. 30 (from Darlington [484]). 

Further information and tables for Wilcoxon rank distributions can be found in [1290, 
1436, 2221] and [2222]. [1076] 

Mann- Whitney U Test 

Wilcoxon's signed rank test [2220] has the drawback that the data samples must be arranged 
in pairs (a i} hi) and thus, the number of the dj has to be the same as the 6j. In many practical 
experiments, this is not the case and Wilcoxon's test is not applicable. Assume we have run 
two experimental series with different EAs applied to the same problem for some time (where 
each run in a series has the same configuration) . Then, there is no group relation between the 
measurements at and 6, and creating tuples as needed for the signed rank test is, basically, 
nonsense. Furthermore, it could be possible that the runs in the first series of tests usually 
finished faster than those in the second series. Then, we would have more samples at than 
hi. In order to apply the signed rank test properly, we need as same as many samples from 
both configuration and therefore would have to discard some samples which may lead to a 
loss of important information. 

The U test 7 * developed by Mann and Whitney [1356] circumvents these problems [1878, 
573, 1520]. It assesses whether two samples are from the same distribution (H ) or not (Hi). 
Different from the sign rank test, it does not require the samples to be paired nor to contain 
the same number of elements. 

Basically, the U test is carried out almost exactly like the signed rank test. We will 
illustrate this using the example for unpaired samples given in Tabic 28.15 where the set a 
with median 3 has n a = 6 samples and b consists of n& = 8 elements with a median of 5.5. 

1. The elements dj and 6, are mixed together and sorted. 

2. Each element now receives a rank corresponding to its position in the list. Like in the 
sign test, elements which have the same value receive the same rank (see point 4 in 
Section 28.8.1). The elements of the first two rows in Tabic 28.15, for instance, both 
receive rank 0.5(1 + 2) = 1.5 whereas those in row 7 to 10 have the rank 0.5(7+10) = 8.5. 



http : //en . wikipedia . org/ wiki/Mann- Whitney _U [accessed 2008-10-24] 
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3. The rank sums R a = J2r a = 28 and Rb = J2r b ~ ^7 are determined, where R a + Rb = 
105 = ^±i) = (with n = n a + n b ) always holds. 

4. The sample statistics are then given as U a = R a - " a( " 2 a+1) = 28 - 21 = 7 and U b = 
R b - " b( " 2 fc+1) = 77 - 36 = 41 (where U a + U b = n a n b always holds). 

5. The smaller of the two values U = mm{U a , Ub} — 7 is used. 

6. For the significance level a the critical U values can be computed for the two-sided test 



where z is probit function, the inverse cumulative distribution function of the standard 
normal distribution (see Definition 28.49). The values of z can be looked up in Table 28.7. 
For a = 0.05 we get z(l - f ) = z(0.975) w 1.96 and for a = 0.01, we find z(l - f ) = 
z(0.995) w 2.575. Hence, U .05 w 24- 1.96^60 w 8.82 and f/ .oi ~ 24- 2.575\/6l) w 4.05. 
7. We compare U with [/ Q and can discard the null hypothesis H if and only if U is 
smaller. 

In the example, U < U0.05 holds whereas U < f/o.oi does not. In other words, with 5% 
chance of error, we can declare the two samples a and b to be different and can assume 
that the median med(a) = 3 is significantly smaller than med(6) = 5.5. If we wish for no 
more than 1% chance of error, however, the differences between a and b arc not significant 
(enough). For a = 0.05, the critical values of t/0.05, i- e., the highest allowed U values for 
which the null hypothesis Ho can be rejected, are listed in Table 28.19. For n a = 6 and 
nb = 8, we find t/0.05 ~ 8 which fits to the value used in point 6 of the example application 
of the test. The U test can be computed with the nice online utility provided by Lowry 



as 




(28.293) 



[1312]. 
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Tabic 28.19: The critical values U . 5 for the two-sided Mann-Whitney U test [2219]. 



Fisher's Exact Test 

Fisher's exact test 79 [678, 680, 15, 252, 1310] tests whether two samples with binary data 
are independent or not. The null hypothesis H is that the values in both samples follow 
the same distribution. If Hq does not hold, the two samples differ in the probabilities with 
which the two possible binary values occur in them and hence, then distribution of these 
values depends on the samples. 

For illustration purposes let us go back to the (unpaired) example data sets given 
in Table 28.15 on page 510. Assume that the columns a and b would represent different 
configurations of an optimizer applied to a single-objective optimization problem. Then, in 



http : / /en . wikipedia . org/ wiki/Fisher°/,27s_exact_test [accessed 2008-12-08] 
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each cell, the objective value (subject to minimization) of the best solution candidate found 
in the corresponding run is noted. Assume we can consider an experiment as successful if 
this value is below 4 and that success (s) or failure (s) was the binary criterion which we 
want to investigate. 

Series a has four successful runs (a s = 4) and two failed ones (a 5 = 2) whereas series b 
succeeded only once (b s = 1) while seven runs could not create a solution candidate with 
an objective value below 4 (bs = 7). We have sketched this scenario (let us call it C 5 ) in 
Table 28.20. Series a seems to be more successful than series b in this example. The question 
is if this is indeed statistical significant or whether it might have been a fluke as well (and 
the null hypothesis H is more likely to hold). 




Table 28.20: An 2 x 2 contingency table based on Table 28.15. 



Assume that the distribution of s and s was the same in the stochastic processes which 
have been sampled (as a and 6), i.e., that the null hypothesis Hq holds. Then, the probability 
that any configuration C = (a s , a 5 , b s , bj) would have resulted is given in Equation 28.294. 
In Equation 28.295, we apply this equation to the scenario C5 shown in Table 28.20 and 
obtain a probability of roughly 6% for it. 

p(r , ( a, t ba ) (as + b s )l (as + bs)\ (a s + q^)! (b s + bj) ! 

P[C) I .Y! „.!,,,! /,,!/,,! (28 - 294) 

P(C 5 )= 5! SeTs! 1264146186240000 ,^o.o 599401 (28 . 295) 
v b> 14! 4! 2! 1! 7! 21090172 207104 000 1001 v ' 

In order to test whether we can reject the null hypothesis, we need to compute the total 
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P(Ci)w 0.0280 P(C 2 ) -0.2098 P(C 3 )^ 0.4196 P(C 4 )^ 0.2797 P(C* 5 ) -0.0599 P(C 6 )^ 0.0030 
£>(Ci) = 0.5 D(C 2 ) = 0.35 D(C 3 ) = 0.04 D(C 4 ) = 0.26 D(C B )=0.57 D(C 6 ) = 0.8 



Table 28.21: All configurations with the same total sums in a/b and s/s than in Table 28.20. 



probability that a configuration at least as extreme as C5 with the same a/b and s/s division 
(the E) row an d column) can occur. In Table 28.21 we list the scenarios with the same 
marginal distributions and their corresponding probabilities (which sum up to one) in the 
second-to-last row. What remains to do is to find out which of these scenarios are at least 
as extreme as C 5 . For each scenario C, we compute the disproportion D(C) - the degree of 
ostensible dependency between the two samples [1310] - according to Equation 28.296. For 
C5, we have computed this value in Equation 28.297. 
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D{C) 
D(C 5 ) 



a s + b s 
4 



+ h 

2 



4 + 1 2 + 7 



26 
45 



0.57 w 0.578 



(28.296) 
(28.297) 



In the bottom row of Table 28.21, we have listed the disproportion values D for all six 
possible scenarios. Amongst these, only scenario C 6 and C\ (and C 5 ) have a D-value at 
least as big as C 5 , so we can compute the probability p with which an experimental outcome 
at least as extreme as the observed C5 would occur under Ho as p = P(C\) + P{Cc,) + 
P{Cq) = 0.09 sa 0.09. Hence, under a significance level of a = 10%, we could reject the null 
hypothesis H n and assume that there is a difference between the samples a and b. If we want 
a significance level of a = 5%, we cannot reject H and must consider the experimental 
outcome as fluke. 

Computing Fisher's exact test by hand may become time consuming. The online utilities 
provided by Lowry [1311] and Langsrud [1246] (which sometimes produce slightly different 
results) provide a handy alternative. 



28.9 Generating Random Numbers 

Definition 28.67 (Random Number). Random numbers are the values taken on by a 
random variable. A random number generator 80 produces a sequence r = (r\,r2, ■ ■ .) of 
random numbers r, as result of independent repetitions of the same random experiment. 

Since the numbers Ti are all produced by the same random experiment, they approximate 
a certain random distribution according to the law of large numbers (see Section 28.3.8 on 
page 478). 

For true random number generators, there exists no function or algorithm f(i) = or 
f{fi-n+ii fi-n+2, ■■, Ti) = rj+i that can produce this sequence in a deterministic manner with 
or without knowledge of the random numbers previously returned from the generator. Such 
behavior can be achieved by obtaining the numbers ?*j from measurements of a physical 
process, for instance. Today, there exist many such so-called hardware random number 
generators 81 [603, 2126, 676, 1981]. 

Of course, most computers are not equipped with special hardware for random number 
production, although some standard devices can be utilized for that purpose. One could, for 
example, measure the white noise of soundcards or the delays between the user's keystrokes. 
Such methods have the drawback that they require the presence of and access to such 
components. Furthermore, the speed of them is limited since you cannot produce random 
numbers faster than the recording speed of the soundcard or faster than the user is typing. 

28.9.1 Generating Pseudorandom Numbers 

In security-sensitive areas like cryptography, we need true random numbers [2125, 1981, 
1373]. For normal PC applications and most scientific purposes, pseudorandom number 
generators 82 are sufficient. 

The principle of pseudorandom number generators is to produce a sequence of numbers 
r = (ri,r 2 , --,ri) , rj £ R Vj <G N, R C R which are not obviously interdependent, i. e., if 
knowing a number ri there is not simple way to find out the value of r i+1 . 

80 http : //en. wikipedia. org/wiki/Random_number_generator [accessed 2007-07-03], http://en. 
wikipedia. org/wiki/Random_number_generation [accessed 2007-07-03] 

81 http://en.wikipedia.org/wiki/Hardware_random_number_generator [accessed 2007-07-03] 

82 http : //en. wikipedia. org/wiki/Pseudorandom_number_generator [accessed 2007-07-03], http:// 
en.wikipedia.org/wiki/Pseudorandomness [accessed 2007-07-03] 
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Of course, since the values are no real random numbers, there is an algorithm or 
function / : V — > R x V where R is the set of possible numbers and V is the space of 
some internal variables. These internal variables are referred to as seed and normally change 
whenever a new number is produced. Often, the seed is initialized with either a true random 
number or the current system time. In the first case, it is also practicable to re-initialize the 
seed from time to time with new true random values. 

Pseudorandom numbers are attractive to all not security-critical applications where we 
need some sort of unpredictable behavior. They are often used in games or simulations, 
since they usually can be generated much quicker than true random numbers. On the other 
hand, especially in scientific applications the "degree" of randomness is very important. 
There are many incidents, for example in physical simulation, where the inappropriate use of 
pseudorandom number generators of poor quality lead to wrong conclusions [2112, 1852, 662]. 
It should be noted that there also exist cryptographically secure pseudorandom number 
generators 8 ' 5 which create pseudo-random number that are of especially high quality. 

There exists a variety of algorithms that generate pseudorandom numbers [2008, 1611, 
234, 348] and many implementations for different programming languages and architectures 
[918, 1594, 421]. It is even possible to evolve pseudorandom number generators using Genetic 
Programming, as shown, for instance, by Koza [1192]. 

Linear Congruential Generator (LCG) 

The linear congruential generator 84 (LCG) was first proposed by Lehmer [1272] and is one of 
the most frequently used and simplest pseudo random number generators [1271]. It updates 
an internal integer number v € V = (0 . . . (m — 1)), m € N in each step according to 
Equation 28.298. The modulus to is a natural number which defines the maximum number 
of values v can take on. a and 6 are both constants. Therefore, v will periodically take on 
the same values - at most after m steps. The pseudorandom numbers ri are approximately 
uniformly distributed in the interval [0,to) (see Section 28.4.1) and can be computed as 
proposed in Equation 28.299. 



If the full period can really be reached depends a lot on the values of the parameters a, 
6, and to. There are many constellations known where only a small fraction of the period to 
is utilized [634]. In order to produce the full period, the following requirements should be 
met according to Wikipcdia [2219]. 

1. b and m are relatively prime 

2. a — 1 is divisible by all prime factors of m 

3. a — 1 is a multiple of 4 if to is a multiple of 4 

4. to > max {a, b, vq} 

5. a > 0,6 > 

Good standard values for the constants are a = 1 664 525, 6= 1 013 904 223, and m = 2 32 . 
One of the widest spread realizations of LCGs has been outlined by Knuth [1161]. In Java, 
the class java.util. Random uses this approach with the settings a = 25214903917, 6 = 11, 
and to = 2 48 . 

83 http : //en. wikipedia. org/wiki/Cryptographically_secure_pseudorandom_number_generator 

[accessed 2007-07-03] 

84 http://en.wikipedia.org/wiki/Linear_congruential_generator [accessed 2007-07-03] 



Vi = (avi-i + 6) mod to 



(28.298) 
(28.299) 
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28.9.2 Random Functions 

Definition 28.68 (Random Function). In the context of this book, we define a random 
function random as a construct that eases the utilization of random numbers and random 
variables. It represents access to a random process, an infinite sequence of random variables 
Xi all distributed according to the same distribution function. Starting with X\, each time 
a random function is evaluated, it returns the value of the next random variable in the 
sequence i = 1, 2, 3, 

Definition 28.69 (Uniformly Distributed Random Number Generator). We define 
the function random„(f, f) to draw uniformly distributed (see Section 28.5.1 on page 485) 
random numbers from the interval with the boundaries f (inclusively) and r (exclusively). 
The parameter-less function random M () will return a uniformly distributed number from the 
interval spanning from inclusively to 1 exclusively. 

random u (f, f) G [f, r)CR,f,rgR,r<f (28.300) 
random„() = random„(0, 1) (28.301) 

The random„()-function can be realized with the linear congruential pseudorandom num- 
ber generators that we have just discussed in Section 28.9.1, for example. 

Definition 28.70 (Normally Distributed Random Number Generator). We de- 
fine the function random„(/z, cr 2 ) to generate normally distributed (see Section 28.5.2 on 
page 486) random numbers with the expected value [i and the variance a 2 . The parameter- 
less function random„() will return a standard normally distributed number (with n = 
and cr 2 = 1). 

random n (^, cr 2 ) - N(fj,, cr 2 ) (28.302) 
random n () = random„(0, 1) (28.303) 

Cut-off Random Functions 

We often use random processes and random functions to model or simulate a certain features 
of a real system. If we, for example, simulate a chicken farm, we might be interested in the 
size of the eggs laid by the hens. We can assume this weight to be normally distributed 8,1 
around some mean jj, with a variance a 2 ^ 0. In the simulation, a series of egg weights 
is created simply be drawing subsequent such random numbers by calling random„(^, cr 2 ) 
repeatedly. Although the normal distribution is a good model for egg weights, it has a serious 
drawback: no matter how we chose n or cr, there is still a positive probability of drawing 
zero, negative, or extremely large (> lOt) weights. In reality however, such things could have 
not yet been documented. 

What we need here is a cut-off mechanism for our random function random n (/i, cr 2 ) that 
still preserves as many of its properties as possible. Given any random function random 
the function random/ (random, low, hig h), defined as Algorithm 28.2, ensures that low < 
random; (random, low, high) < high. 

28.9.3 Converting Random Numbers to other Distributions 

There are occasions where random numbers of a different distribution than available are 
needed. We could, for example, have a linear congruential generator for uniformly distributed 
random numbers like elaborated in Section 28.9.1 but may need normally distributed values. 



see Section 28.5.2 on page 486 
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Algorithm 28.2: r < — random; (random, low, high) 



Input: random: a random function (maybe with further implicit parameters) 
Input: low £ R: the inclusive, lower bound of the random result 
Input: high £ R, low < high: the exclusive, upper bound of the random result 
Data: r: the intermediate random value 

Output: r: a value returned by random with low < r < high 

l begin 

repeat 

j r < — random() 
until (r>low) A (r < high) 
return r 



2 

3 

4 
5 

6 end 



Uniform Distribution — > Uniform Distribution 

If we have random numbers distributed uniformly in the interval [a\, b\) and need random 
numbers Si uniformly distributed in the interval [a 2 , b 2 ), they can be converted really simple 
according to 




Uniform Distribution — > Normal Distribution 

In order to transform random numbers which are uniformly distributed in the interval [0, 1) 
to standard-normally distributed random numbers (^1 = 0, a 2 = 1), we can apply the Box- 
Mullcr 86 transformation [262]. This approach creates two standard-normally distributed 
random numbers m, n 2 from two random numbers n, r 2 which are uniformly distributed 
in [0, 1) at once according to Equation 28.305. In both formulas, the terms \J — 21nri and 
27rr 2 are used. The performance can be increased if both terms are computed only once and 
reused. 

ni = \f—2 In ri cos(27rr2) 

n 2 = ^-2\n ri sin(27rr 2 ) (28.305) 

The polar form of this method, illustrated as Algorithm 28.3, is not only faster, but 
also numerically more robust [556]. It creates two independent random numbers uniformly 
distributed in [—1, 1) and computes their product w. This is repeated until w G (0, 1). With 
this value, we now can compute two independent, standard-normally distributed random 
numbers. Effectively, we have traded a trigonometric operation and a multiplication against 
a division compared to the original method in Equation 28.305. The implementation of this 
algorithm is discussed in [1161] which is the foundation of the method nextGaussian of the 
Java-class java.util. Random. 

Normal Distribution — > Normal Distribution 

With Equation 28.306, a normally distributed random number ri\ <~ 7V(/ii, er^) can be 
transformed to another normally distributed random number n 2 ~ iV(/!2, erf) • 

n 2 = n 2 + cr 2 * 111 - ^ (28.306) 
01 



http://en.wikipedia.org/wiki/Box_muller [accessed 2007-07-03] 



530 28 Stochastic Theory and Statistics 



Algorithm 28.3: (711,712) 



random Il!p 2 () 



Data: 711,712: the intermediate and result variables 
Data: w. the polar radius 

Output: (711,712): a tuple of two independent values m 
1 begin 



iV (0,1), n 2 ~ N(0,1) 



repeat 

m < — random u (— 1, 1) 
n-2 < — random„(— 1, 1) 
w < — (m * Tii) + (712 * n^) 
until (to > 0) A (w < 1) 

7^ 
return (m * 70, n2 * w) 



9 end 



Uniform Distribution — > Exponential Distribution 

With Equation 28.307, a random number r uniformly distributed in the interval (0, 1) (0 is 
excluded) can be transformed into a exponentially distributed random number s ~ exp(X). 

— In r 

s = —— (28.307) 



Exponential Distribution — > Exponential Distribution 

With Equation 28.308, an exponentially distributed random number n ~ exp{\\) can be 
transformed to an exponentially distributed number r 2 ~ exp(A 2 ). 

r 2 = (28.308) 
A 2 



Uniform Distribution — > Bell-shaped Distribution 

The bases of many numerical optimization algorithms is the modification of a value x by 
adding some random number to it. If the probability density function of the underlying 
distribution producing number is symmetrically bell-shaped, the result of the additive mod- 
ification will be smaller or larger than x with the same probability. Results which arc close to 
x will be more likely than such that are very distant. One example for such a distribution is 
the normal distribution. Another example is the bell-shaped random number generator used 
by Worakul et al. [2255, 2256], defined here as Algorithm 28.4. It is algorithmically close to 
the polar form of the Box-Mullcr transform for the normal distribution (see Algorithm 28.3) 
but differs in the way the internal variable w is created. The function randorrib s (fi,a) cre- 
ates a new random number according to this distribution, with an expected value \i and the 
standard deviation a. 

You may have wondered about the factor 0.5513 in the algorithm. This number "normal- 



izes" the standard deviation of the bell-shaped distribution, since D 2 r(y) = In =h 1- 

We can show this by first determining the cumulative distribution function Fx(x) for r(y) in 
Equation 28.311 and then differentiating in order to obtain the probability density function 
fxx in Equation 28.313. 
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Algorithm 28.4: y < — random^ (/z, a) 



Input: /i: the mean value of the bell-shaped distribution 

Input: a: the approximate standard deviation of the bell-shaped distribution 
Data: w: a uniformly distributed random number w € (0, 1) 
Output: y: a bell-shaped distributed random number 

1 begin 

2 repeat 

3 j w < — random u () 

4 until (w > 0) A (w < 1) 

5 y< — tt + cr*0.5513*m(^) 

6 return r 



7 end 



F x (x) 




(28.309) 


X 


= r(y) = *( 1 y _ y ) 


(28.310) 


F x (x) 


e x 

~ V ~ l + e x 


(28.311) 


fx(x) 


, \ dx 
= X ^ dy 


(28.312) 


e x \ dx 


e x {l + e x ) - e x {e x ) 




1 + e x J dy 


{l + e x f 




fx(x) 


e x 

{l + e x f 


(28.313) 



Unfortunately, here it stops. We can neither apply Equation 28. 5G on page 473 
or Equation 28.63 on page 474 in order to determine the expected value or the variance, 
since both will result in integrals that the author 87 cannot compute. However, it is easy to 
see that EX = 0, since r(y) is point symmetric around 0.5. The value D 2 X ss 3.28984 I 
can only determine numerically with the small Java program Listing 28.1 which bases on 
the idea that we can assume the uniform random numbers to be uniformly distributed in 
(0,1) (of course). Hence we can simulate a "complete sample" by iterating over codeili = 
1 to T-l and take i/T as input for r(y). Since we step over all i from 1 to T-l, this resem- 
bles a uniform distribution and also leaves away the special cases y = (-~i=0) and y = 1 
(~i=T). Furthermore, we can skip half of the steps since our distribution is symmetric. Well, 
EX = if n = and therefore we can simplify D 2 X = E[X] - (EX) 2 (see Equation 28.61 
on page 474) to D 2 X = E[X\. 

This method is, of course, very crude and subject to numerical errors in the floating 
point computations. However, with D 2 X w 3.28984 and DX = \[U 2 X w 1.8138 we know 
that we have to scale r(y) by w 0.5513 (see Equation 28.65 on page 474) so the standard 
deviation the bell-shaped distribution random^ (/z, a) will become D[randomf, s (/z, a)] « a. 



Yes. I suck in maths. 
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long i , max ; 
double sum2 , v ; 

max = 10000000; 
sum2 = 0; 
v =0; 

// distribution is symmetric -> iterate one wing 
for (i = (max>>l) ; i < max; i++) { 

v = Math . log (( (double) (max - i)) / ((double) i)); 

sum2 += (v * v) ; //sum up the squares of the single terms 

} 

System . out . print ( sum2 / (max - (max > >1) ) ) ; 

Listing 28.1: Approximating D 2 X of r(y). 



28.10 List of Functions 
28.10.1 Gamma Function 

Definition 28.71 (Gamma Function). The Gamma function 88 r : C i— > R is the exten- 
sion of the factorial (see Definition 28.8 on page 467) to the real and complex numbers. For 

complex numbers zeC with a positive real part Re(z) > it is defined as: 

/•OO 

r(z) = / t z - x e- l dt (28.314) 
Jo 

Furthermore, the following equations hold for the gamma function. 

r(z + 1) = zr(z) (28.315) 

r(l) = 1 (28.316) 

r(z) = (z-l)!VzeN (28.317) 

r(z) = lim — (28.318) 

n^oo Z (z+l)..(z + n) 



r(z) = — f\(l + -) \i (28.319) 
z ±J - \ n) 

7 in Equation 28.319 denotes the Euler-Mascheroni constant 89 . 



7 = lim 



0.57721566490153286060651209008240243 . . . (28.321) 



28.10.2 Riemann Zeta Function 

Definition 28.72 (Riemann Zeta Function). The Riemann zeta function 90 £(s) [1733] 
is the function of the complex variable s defined as 



88 http://en.wikipedia.org/wiki/Gamma_function [accessed 2007-09-30] 

89 http://en.wikipedia.org/wiki/Euler-Mascheroni_constant [accessed 2007-09-30] 

90 http://en.wikipedia.org/wiki/Riemann_zeta_function [accessed 2008-08-24] 



i—1 V primes p 

Some values of the zeta function are listed in Table 28.22 
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+oo 

<m=e* = n ( 28 - 322 ) 



a C(n) 



c(o) = -y 2 

V2 C(V2) « -1.460 354508809 586! 

1 C(l) °° 



2 C(2) « 1.645 
5/2 C(s/2) w 1.341 

3 <(3) « 1-202 
7 /2 C (72) « 1-127 

6 C(6) « 1-0173 



Table 28.22: Some values of the Riemann zeta function. 
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Clustering 



Clustering algorithms 1 divide a dataset into several disjoint subsets. All elements in such 
a subset share common features like, for example, spatial proximity. Clustering has many 
different applications like: 

f . Data Mining and Data Analysis [133, 610, 1430, 183], 

2. Information Processing and Information Management [187, 1151, 2265, 1770], 

3. Pattern Recognition [203, 2156, 1172], 

4. Image Processing [1102, 1817], and 

5. Medicine [1868, 477, 2316]. 

Definition 29.1 (Clustering). Clustering is the unsupervised classification of patterns 
(observations, data items, or feature vectors) into groups (clusters) [1029]. With clustering, 
one dataset is partitioned into subsets (clusters), so that the data in each subset (ideally) 
share some common trait - often proximity according to some defined distance measure. 
Figure 29.1 illustrates a possible result C of the application of a clustering algorithm to a 
set A of elements with two features. 




°„ ° C=cluster(A) 



c.eC 



Cl eC 



e o 
c o 



c,eC 



ceC 



Figure 29.1: A clustering algorithm applied to a two-dimensional dataset A. 



In the field of global optimization there is another application for clustering algorithms. 
For many problems the set of optimal solutions X* is very large or even infinite. An optimiza- 
tion algorithm then cannot be able to store or return it on the whole. Therefore, clustering 
techniques are often used in order to reduce the optimal set while not losing its characteris- 
tics - the diversity of the individuals included in the "current optimal set" is preserved, just 
their number is reduced. Especially in elitist evolutionary algorithms (see Definition 2.4 on 
page 103) which maintain an archive of the best individuals currently known. 

Data clustering algorithms are either hierarchical or partitional. A hierarchical algorithm 
uses previously established clusters to successively find new clusters. The result of such an 
algorithm is a hierarchy of clusters. Partitional algorithms, on the other hand, determine 



1 http://en.wikipedia.org/wiki/Data_clustering [accessed 2007-07-03] 
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all clusters at once. In the context of this book, we do only need the division of a set into 
clusters - a hierarchy of this division is unnecessary. 

There also exist so-called fuzzy clustering 2 [1106, 1216] methods that do not create 
clear divisions but assign a vector of probabilities to each element. This vector contains a 
component for each cluster which denotes the probability of the element to belong to it. 
Again, in the context of this book, we only regard clustering algorithms that group each 
data element to exactly one single cluster. Therefore, we define a clustering algorithm as 
follows: 

Definition 29.2 (Clustering Algorithm). A clustering algorithm C — cluster(A) con- 
structs a set C consisting of elements which are disjoint subsets of a set A and, if united, 
cover A completely (see also Figure 29.1). 

C = cluster(A) Vc e C, VoGc^aeiA 

Vci ^ c 2 A ci, c 2 e C ci H c 2 = A 



\faeA3ceC:aec (29.1) 

deduced: [J = A (29.2) 

VcGC 

deduced: C C V(A) (29.3) 



For the last deduced formula see the definition of the power set V, Definition 27.9 on 
page 458. 

There is however one important fact that must not be left unsaid here: Although we 
define clustering algorithms in terms of sets for simplicity, they are actually applied to 
lists. A set can contain the same element only once, hence {1,2, 1} = {1,2}. A clustering 
algorithm however may receive an input A that contains multiple equal elements. This is 
our little dirty backdoor here, we consider A = {a_, a 2 , .., a„} as the input set and allow its 
elements to have equal values, such as a\ = 1, a 2 = 2, and = 1. When performing the 
clustering, we only consider the symbols a\ . . . a n . This allows us to use straightforward and 
elegant set-based definitions as done in Definition 29.2 without loss of generality. 

Definition 29.3 (Partitions in Clustering). We define the set £ of all possible partitions 
of A into clusters C. Furthermore, the subset €k C C is the set of all partitions of A into 
k clusters. The number of possible configurations €k for any given k equals the Sterling 
number S(\A\,k) [221]. 

VCeCo VceC, Vaec^aeAA 
Vci , c 2 e C => ci n c 2 = A 

VaeABceC : a e c (29.4) 
C ee fc ^C e£A\C\=k (29.5) 

|£ fe | = ^(|A|,fc) = i^(-l) fe -Y^" (29.6) 

n n 

|£| -]>>^E^I' fc ) (29.7) 

fe=i fc=i 

On the elements a of the set A which are subject to clustering, we impose an simple 
restriction: Although we allow any sort of elements a in A, we assume that to each such 
element a there is assigned exactly one single a(a) e M™. In other words, there exists a 
function a:iH M™ which relates the features of each element a of A to a vector of real 
numbers. This allows us to apply distance metrics and such and such. 



2 http://en.wikipedia.org/wiki/Fuzzy_clustering [accessed 2007-07-03] 
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In the context of global optimization, the elements a would for example be the solution 
candidates like evolved programs in the population A and the function a(a) then would 
correspond to the values of their objective functions / G F. 

From now on, we will be able treat the elements a like vectors of real numbers (if needed) 
without loss of generality. Note that even though we assume that there exists a binary 
relation which assigns a real vector to each element of A, this is not necessarily the case for 
the opposite direction. Picking up the previous example it is most probably not likely to 
have one program for each possible combination of objective values. 

Definition 29.4 (Centroid). The centroid' 5 [4] of a cluster is its center, the arithmetic 
mean of all its points to put it plain and simple. 

ccntroid(c) = ^ a (29.8) 
! c ' Vaec 

29.1 Distance Measures 

Each clustering algorithm needs some form of distance measuring, be it between two ele- 
ments or between two clusters. Therefore we define the prototype of a distance measurement 
function as follows: 

Definition 29.5 (Distance Measure). A distance measurement function dist rates the 
distance between two elements of the same type (set) as positive real number which is the 
bigger the bigger the distance between the two elements is. 

dist (a, b) e M + , a, b <= A (29.9) 

29.1.1 Distance Measures for Strings of Equal Length 

Definition 29.6 (Hamming Distance). For two tuples a and b of the same length, the 
Hamming [882] distance 4 dist# am (a, b) is defined as the number of locations in which a and 
b differ. 

dist ffam (a, b) = \{i : a[»] ^ b[i\, VO < i < \a\}\ Va, b : len(a) = len(6) (29.10) 

The Hamming distance is used in many error-correction schemes, since it also equals 
to the number of single substitutes required to change one string into another one. The 
Hamming distance of 100101 and 101001 is 2 whereas the Hamming distance of Hello World 
and Hello Earth is 5. 

29.1.2 Distance Measures for Real- Valued Vectors 

As already mentioned in Chapter 29, we assume that there is a real-values vector in R" 
assigned to each element a E A by an implicit a : A i—» M n -function. Therefore, the distance 
measures introduced here can be used for all A subject to clustering. 

Definition 29.7 (Manhattan Distance). The Manhattan distance 5 distMon(a, b) de- 
notes the sum of the absolute distances of the coordinates of the two vectors a and b. 

n 

dist Afa „(a,b) = l a H ~ b Wl Va,b e K" (29.11) 

i=l 

3 http://en.wikipedia.org/wiki/Centroid [accessed 2007-07-03] 

4 http://en.wikipedia.org/wiki/Hamming_distance [accessed 2007-07-03] 

5 http://en.wikipedia.org/wiki/Manhattan_distance [accessed 2007-07-03] 
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Thus, the Manhattan distance of (1,2,3) T and (3,2, 1) T is 4. 

Definition 29.8 (Euclidian Distance). The Euclidian distance 6 dist euc ;(a, b) is the "or- 
dinary" distance of two points (denoted by the two vectors a and b) in Euclidian space. 
This value is obtained by applying of the Pythagorean theorem 7 . 



dist eud (a,b) = ( a W ~ b W) 2 Va,b e W 1 (29.12) 

\ i=i 

Therefore, the Euclidian distance of (1, 2, 3) T and (3, 2, 1) T is \/8. 

Definition 29.9 (Norm). A vector norm 8 , denoted by ||a|| is a function which assigns a 
positive length or size to all vectors a in a vector space (or set) 4C1", other than the zero 
vector. 

Some common norms of the element a G A C 1" are: 

1. The Manhattan norm 9 : 

n 

IHIi = X)|aWl (29-13) 

i=l 

2. The Euclidian norm: 



J2(m) 2 (29.14) 
\ i=i 

3. The p-norm is a generalization of the two examples above: 

\H P = (Ela^y (29.15) 

4. The infinity norm 10 is the special case of the p-norm for p — > oo: 

Hall^ =max{|a[i]|,|a[2]|,..,|a[n]|} (29.16) 

Such norms can be used as distance measures, and we hence define a new distance 
measurement function as: 



dist„ iP (a,b) 

distMan 

dist euc ; 



||a-b||_ Va,beAC 

dist njl 

dist„ 2 



(29.17) 
(29.18) 
(29.19) 



If the places of the vectors a have different ranges, for example a[i] e [0...1] and 
a[2] € [0.. 100000], a norm of the difference of two such vectors may not represent their 
true "semantic" distance. Here, the contribution of the first elements of two vectors a and b 
to their distance will most likely be negligible. However, the two vectors (0, 0) T and (1, 100) T 
may be considered "more different" than (0, 0) T and (0.1, 110) T , since they differ the whole 

the dist™ p distance 



range in their first elements. Therefore, an additional distance measure 
is defined which normalizes the vector places before finally computing the norm. 



dist" p (a,b) 



a[i]-bH 
range(A)[i] 

a[i] - b[i] 



if range(A)[i] > 

otherwise 



(29.20) 



http://en.wikipedia.org/wiki/Euclidean_distance [accessed 2007-07-03] 

7 http://en.wikipedia.org/wiki/Pythagorean_theorem [accessed 2007-07-03] 

8 http://en.wikipedia.org/wiki/Vector_norm [accessed 2007-07-03] 

9 http://en.wikipedia.org/wiki/Taxicab_geometry [accessed 2007-07-03] 
http://en.wikipedia.org/wiki/Maximum_norm [accessed 2007-07-03] 
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29.1.3 Elements Representing a Cluster 

We already stated that there is not necessarily an a G A assigned to each real vector in R". 
Thus, there also does not necessarily exist an element a in the center centroidc of a cluster c. 
For our purposes in this book, we are however interested in elements representing clusters. 
Since I have not found any other in literature, we will call such elements nuclei. We can 
define different functions nucleus (c G C) to compute such nuclei which, in turn, depend on a 
distance measure. We will again abbreviate this distance function by dist. dist is an implicit 
parameter which can be replaced by any of the functions introduced before. Also again, the 
default setting is dist = dist ettc / = dist„ j2 . 

The first possible nucleus method, nucleus c , would be to take the element which is closest 
to the centeroid centroidc of the cluster c: 

n G c = nucleus(c) <^ dist(a, centroid(c)) > dist(n, centroid(c)) Va G c (29.21) 



29.1.4 Distance Measures Between Clusters 

In order to determine the distance between two clusters, another set of distance measures can 
be applied. Such distance measures usually will compute the distance between two clusters 
as a function of the distances between their elements which is, in turn, defined using a 
secondary distance function. We will abbreviate this secondary distance function by dist 
which can be replaced by any of the functions named in the above subsections. We assume 
it to be an implicit parameter with the default value dist = dist eMC / = dist„ i2 . Let c\ and c 2 
be two clusters in C, then we can define the following distance measures between them: 

1. The maximum distance between the elements of the two clusters (also called complete 
linkage): 

dist ma;I ; (ci , c 2 ) = max {dist(a, b) , Va G c\,b G c 2 } ; Vci, c 2 G C (29.22) 

2. The minimum distance between the elements of the two clusters (also called single 
linkage): 

dist TO j„(ci, c 2 ) = min{dist(a, b) , Va G C\,b G c 2 } ; Vc 1; c 2 G C (29.23) 

3. The mean distance between the elements of the two clusters (also called average linkage): 

dist a „ 9 (ci,c 2 ) = I I * I I dist ( a , & ); Vci , c 2 G C (29.24) 

|ci| * |c 2 | Va£cl Vbec2 

4. The increase in variance dist l)ar (a, b) if the clusters were merged. 

5. The distance of their centers: 

dist ce „ t (ci, c 2 ) = dist(centroid(ci) , centroid(c 2 )) ; Vci,c 2 G C (29.25) 

6. The distance of their nuclei computed by the nucleus function nucleus (see the definition 
of nucleus in Section 29.1.3): 



dist„ MC ;(ci, c 2 ) = dist (nucleus (ci) , nuclcus(c 2 )) ; Vci,c 2 G C (29.26) 
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29.2 Clustering Algorithms 



29.2.1 Cluster Error 

The most commonly used partitional clustering strategies are based on the square error 
criterion. The general aim is to obtain a partition which minimizes the square error for a 
given number k [1028] which we generalize to fit any given distance measure dist: 

Definition 29.10 (Clustering Error). Wc define the error error c inside a cluster as the 
sum of the distances of its elements from its center basing on a distance measure function. 
The total error of a partition error p is then the sum of all the errors of the clusters included. 
Normally, we will use dist euc ; = dist„ 2 a s distance measure. 

crror c (c) = ^ dist (a, centroid(c)) (29.27) 
error p (C) = error c (c) = dist(a, centroid(c)) (29.28) 

VcGC MceCVaec 

Normally, this error is minimized under the premise of a fixed number of clusters k = \C\. 
Then, an optimum configuration C* is searched within the set of all such partitions of A 
into k clusters C. This optimum C* is defined by crror p (C*) = min {error p (C) VC € 
Since testing all possible configurations C is too expensive (see Equation 29.7), finding the 
optimum C* is an optimization tasks itself. Doing so inside an optimization process itself 
hence only rarely is applicable. Here wc will introduce some algorithms which approximate 
good C. 

29.2.2 fc-means Clustering 

fc-means clustering 11 [1343, 210, 2111] partitions the data points a e A into k disjoint 
subsets c C A, c e C C <£ k . It tries to minimize the sum of all distance of the data points 
and the centers of the clusters they belong to. In general, the algorithm does not achieve a 
global minimum of over the assignments. Despite this limitation, fc-means clustering is used 
frequently as a result of its ease of implementation. [2191] 
fc-means clustering works approximately as follows [1028]: 

1. Select an initial partition of fc clusters. 

2. Create a new partition by assigning each a e A to the cluster with the closest center. 
Repeat this until the partition does not change anymore. 

3. Modify the cluster set by merging, dividing, deleting or creating cluster. If the clustering 
error of the new partition is smaller than the error of the previous one then go back to 
step 2. 

In order to perform the modification of the cluster set, we introduce a function called 
kMeansModify obeying the following conditions. 

Cnew = kMeansModify k (C) =>■ Va e d e C 3c 2 <G C new : a g c 2 A 

Va g c 2 g C new 3d g C : a g d (29.29) 

In other words, kMeansModify translates one set of clusters C to another one C new 
by redeeming Definition 29.2 on page 536. One (crude) example for an implementation of 
kMeansModify is specified as Algorithm 29.1. 

We demonstrate how fc-means clustering works in Algorithm 29.2. As distance measure 
dist (lines 23 and 25) usually the Euclidian distance between the centroids of the clusters d 
and C2, dist ce „t(ci, c 2 ), see page 539, is used. 



http : //en. wikipedia. org/wiki/K-nearest-neighbor_estimator [accessed 2007-07-03] 
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Algorithm 29.1: C new < — kMeansModify fe (C) 

Input: [implicit] k: the number of clusters wanted, k < \A\ 

Input: [implicit] dist: the distance measure between clusters to be used 

Input: [implicit] dist2: the distance measure between elements to be used 

Input: C: the list of clusters c to be modified 

Data: m: the index of the cluster C[m] with the lowest error 

Data: n: the index of the cluster C[n] nearest to C[m] 

Data: s: index of the cluster C[s] with the highest error 

Output: C new : the modified list of closters 



1 


begin 


2 




m < — m : error c (C[m]) = min {C[i] Vi G [0, k — 1]} 


3 




n < — n : dist(C[m], C[n]) = min {dist(C[m], C[i]) Vi G [0, k — 1] \ {m}} 


4 




s < — s : error c (C[s]) = max {error c (C[i]) Vi G [0, k — 1] \ {m, n}} 


5 




C[m] < C[m] U C[n] 


6 




a < — a G C[s] : dist 2 (a, centroid(C[s])) > dist 2 (&, centroid(C[s])) Vi> G C{ 


7 




C[n] < {a} 


8 




C[s] < — C[.,] \ {a} 


9 




return B 


10 


end 



29.2.3 n th Nearest Neighbor Clustering 

The n th nearest neighbor clustering algorithm is defined in the context of this book only. It 
creates at most k clusters where the first k — 1 clusters contain exactly one element. The 
remaining elements are all together included in the last cluster. The elements of the single- 
element clusters are those which have the longest distance to their n t,l -nearest neighbor. 
This clustering algorithm is suitable for reducing a large set to a smaller one which contains 
still the most interesting elements (those in the single-element clusters). It has relatively low 
complexity and thus runs fast, but on the other hand has the setback that dense aggregations 
of > n elements will be put into the "rest elements" -cluster. For n, normally a value of 
n = Vk is used. 

n th nearest neighbor clustering uses the k th nearest neighbor distance function dist^ n k 
introduced in Definition 28.63 on page 506 with its parameter k set to n. Do not mix this 
parameter up with the parameter k of this clustering method - although they have the same 
name, they are not the same. I know, I know, this is not pretty. 

Notice that Algorithm 29.3 should only be applied if all the elements a € A are unique, 
i.e., there exists no two equal elements in A) which is, per definition, true for all sets. In a real 
implementation, a preprocessing step should remove are duplicates from A before clustering 
is performed. Especially our home-made nearest neighbor clustering variant is unsuitable 
to process lists containing the same elements multiple times. Since all equal elements have 
the same distance to their n th neighbor, it is likely that the result of the clustering is very 
unsatisfying since one element may occur multiple times whereas a variety of different other 
elements is ignored. Therefore, the aforementioned preprocessing should be applied, which 
may have the drawback that we could possible obtain a set C with less than k clusters. In 
the Sigoa system's implementation of the n th nearest neighbor clustering, only one instance 
of each group of equal elements in A is permitted to become a single-node cluster per run 
and multiple runs are performed until k clusters have been created. 

29.2.4 Linkage Clustering 



The linkage method [1466, 2329] is used to create a set C containing at most k clusters. 
This algorithm initially creates a cluster of each single element in the set A. This set C of 
cluster c is reduced melting together the two closest clusters iteratively. Again, the distance 
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Algorithm 29.2: C < — kMeansCluster ^ (A) 



Input: A: the set of elements a to be clustered 

Input: [implicit] k: the number of clusters wanted, < k < \A\ 

Input: [implicit] dist: the distance measures between clusters to be used 

Input: [implicit] dist2: the distance measures between elements to be used 

Input: [implicit] kMeansModify: a function that modifies the cluster set 

Data: C: the tuple of clusters c computed, \C\ = k 

Data: A cpy : a temporary copy of A used for initialization 

Data: C a id- the cluster set of the previous inner iteration 

Data: Cnew ■ the cluster set of the current inner iteration 

Data: i: a counter variable for the loops 

Data: d: the distance between the cluster {a} and the current cluster in C id 
Data: d m i„: the minimum distance between {a} and any cluster in Cold 
Data: i m i„: the index of that cluster with the minimum distance in Cold 
Output: c: the set of clusters - all the items of the tuple B represented as set 

begin 

A i A 

k < min{fc, \ A\} 

Cnew < — createList(fc, 0) 
i < — lon(Cnew) — 1 
while i > do 

Cnew[i] < {(X G A C py} 



A c 



A Cpy \ C[i] 



Cnew [0] • 

repeat 

C ^ 



i- 1 



C„ew « — kMeansModify fe (C„ eu ,) 
repeat 

Cold * Cnew 

i < — \en(Cnew) — 1 
while i > do 

C„ew[i] < 

i < — i — 1 

foreach a £ A do 

i < — len(C oid ) - 1 

dmin < dist({a} , C o ld[0}) 

while i > do 

d< — dist({a} ,C id\i\) 
if d < dmin then 

d„__ .. 

i 



"mm 
^min < 



C 71 



C n 



|U{a} 



Until Cold = Cnew 

until error p (C) < error p (C n 
return listToSet(C) 



34 end 
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Algorithm 29.3: C < — nNearestNeighborCluster fc (n) A 



Input: A: the set of elements a to be clustered 

Input: [implicit] k: the number of clusters wanted (0 < k < \A\) 

Input: [implicit] n: index for the nearest neighbors 

Input: [implicit] dist: the distance measure to be used 

Data: L: the sorted list of elements 

Data: i: the counter variable 

Output: C: the set of clusters c computed, \C\ = k 
begin 

L < — sortList d ^setToList( J 4) , dist^ n fc distj 

i < min{fc, \L\} — 2 

C < 

while i > do 

C < — CU{L[{\} 
A < — A \ L[i] 
i < — i — 1 



2 
3 
4 
5 
6 
7 
8 

9 

10 end 



return C U {A} 



measure function dist (see line 11 of Algorithm 29.4) used can be any of distance measures 
already introduced. 

According to the cluster distance measure dist chosen, linkageCluster realizes different 
types of linkage clustering algorithms 12 (see Section 29.1.4 on page 539): 

1. If dist(ci,ci) = dist maa; (ci, ci) denotes the maximum distance of the elements in two 
clusters, complete linkage clustering is performed. 

2. If dist(ci, ci) = dist at , 9 (ci, c\) denotes the mean distance of the elements in two clusters, 
average linkage clustering is performed. 

3. If dist (c 1: ci) = dist m i„(ci, Ci) denotes the minimum distance of the elements in two 
clusters, single linkage clustering is performed. 



29.2.5 Leader Clustering 

The leader clustering algorithm is a very simple one-pass method to create clusters. Basically, 
we begin with an empty leader list and an empty set of clusters. Step by step the elements 
a are extracted from the set A subject to clustering, a is then compared to the elements in 
the leader list in order to find one leader I with dist(a, I) smaller than a specified maximum 
distance D. If such a leader exists, a is added to its cluster, otherwise a becomes leader 
of a new cluster containing only itself. The leader clustering can cither be performed by 
using the first best leader I found with dist(a, I) < D and assign a to its cluster ([255], 
Algorithm 29.5) or by comparing a to all possible leaders and thus finding the leader closest 
to a dist(a,Z) < dist(a,/2) V72 e leaders ([84], Algorithm 29.6). 



http : //en. wikipedia. org/wiki/Data_clustering#Agglomerative_hierarchical_clustering 
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Algorithm 29.4: C < — UnkageClusteribkA 



Input: A: the set of elements a to be clustered 

Input: [implicit] k: the number of clusters wanted (0 < k < \A\) 

Input: [implicit] dist: the distance measure to be used 

Input: [implicit] dist2: the distance measure between elements a to be used by dist 
Data: ci : the first cluster to investigate 
Data: c 2 : the second cluster to investigate 

Data: d: the distance between the clusters n and r 2 currently investigated 
Data: d min : the minimum distance between two clusters c r i, c r 2 found in the current 
iteration 

Data: c r \ : the first cluster of the nearest cluster pair 
Data: c r 2 : the second cluster of the nearest cluster pair 

k 



7 

8 
9 
10 
11 
12 
13 
14 
15 

16 
17 
18 



Output: C: the set of clusters c computed, C 
begin 



C < — 

foreach a £ A do C 
while |C| > k do 



CU{a} 



c ri < — 

C r2 < 

foreach ci G C do 
foreach c 2 G C do 
if ci 7^ C2 then 

d < — dist(ci , C2) 
if d < d min then 

dmin * d 
brl < &1 

b r2 < b 2 



C ■ 

c ■ 

c ■ 

return C 



C\c,i 
C\c r2 

C U {c r l U C r 2} 



20 end 
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Algorithm 29.5: C < — lcadcrClustcr^ (^4) 



Input: A: the set of elements a to be clustered 

Input: [implicit] D: the maximum distance between an element an a cluster's leader 

Input: [implicit] dist: the distance measure to be used 

Data: a: an element in A 

Data: i: a counter variable 

Data: L: the list of cluster leaders 

Output: C: the list of clusters c computed 

begin 

foreach a £ A do 

i < — len(L) - 1 
while i > do 

if dist(L[i], a) < D then 

C[i] < C[i] U {a} 

i < 2 



ll 

12 
13 

14 



i < — i — 1 

if i > — 1 then 

L < — addListItem(L, a) 
C < — addListItem(C, {a}) 

return listToSet(C) 



15 end 



Algorithm 29.6: C < — leadcrCluster^(^l) 



Input: A: the set of elements a to be clustered 

Input: [implicit] D: the maximum distance between an element an a cluster's leader 

Input: [implicit] dist: the distance measure to be used 

Data: a: an element in A 

Data: i: a counter variable 

Data: L: the list of cluster leaders 

Output: C: the list of clusters c computed 

begin 

B^Q 

foreach a £ A do 

i < — len(L) - 1 

while i > do 

L if dist(L[i], a) < dist(L[j],a) then j < — i 

if dist(L[j],a) < D then 

| C[j] < C{j] U {a} 

else 

L < — addListItem(L, a) 
B < — addListItem(C,{a}) 



l 

2 

3 
4 
5 
6 
7 
8 

9 
10 
11 
12 
13 

14 

15 end 



return listToSet(C) 
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Theoretical Computer Science 



30.1 Introduction 

Theoretical computer science 1 is the branch of computer science 2 that deals with the rather 
mathematical, logical, and abstract aspects of computing. It subsumes areas like algorithmic 
theory, complexity, the structure programming languages, and the solvability of problems. 

30.1.1 Algorithms and Programs 

In this and the following sections, we want to gain insight into the topic of algorithms, both 
in local and distributed systems. This seems to be appropriate, since any global optimization 
technique which we will discuss in this book is an algorithm. Often even a rather complicated 
one. Sometimes we even want to use several computers to solve an optimization problem 
cooperatively. Thus, we should know about the properties and theory of algorithms as well 
as of distributed systems. 

The second reason is that many example applications discussed in this book will concern 
the automated syntheses of distributed algorithms. To understand these, knowledge of the 
features of distributed algorithms is valuable. 

What are Algorithms? 

The term algorithm comprises essentially all forms of "directives what to do to reach a 
certain goal". A culinary receipt is an algorithm, for example, since it tells how much of 
what is to be added to a meal in which sequence and how everything should be heated. 
The commands inside the algorithms can be very concise or very imprecise, depending on 
the area of application. How accurate can we, for instance, carry out the instruction "Add 
a tablespoon of sugar."? Hence, algorithms are a very wide field that there exist numerous 
different, rather fuzzy definitions for the word algorithm [19, 86, 90, 446, 1213]: 

Definition 30.1 (algorithm). According to Whatis.com 5 , an algorithm is a procedure or 
formula for solving a problem. The word derives from the name of the mathematician, 
Mohammed ibn-Musa al-Khwarizmi, who was part of the royal court in Baghdad and who 
lived from about 780 to 850. Al-Khwarizmi's work is the likely source for the word algebra 
as well. 

Definition 30.2 (algorithm). Wikipcdia 4 says that in mathematics, computing, linguis- 
tics, and related disciplines, an algorithm is a procedure (a finite set of well-defined instruc- 
tions) for accomplishing some task which, given an initial state, will terminate in a defined 

1 http://en.wikipedia.org/wiki/Theoretical_computer_science [accessed 2007-07-03] 

2 http://en.wikipedia.org/wiki/Computer_science [accessed 2007-07-03] 

3 http://searchvb.techtarget.eom/sDefinition/O, , sid8_gci211545 , 00. html [accessed 2007-07-03] 

4 http://en.wikipedia.org/wiki/Algorithm [accessed 2007-07-03] 
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end-state. The computational complexity and efficient implementation of the algorithm are 
important in computing, and this depends on suitable data structures. 

Definition 30.3 (algorithm). An algorithm is a computable set of steps to achieve a 
desired result according to the National Institute of Standards and Technology 5 . 

Definition 30.4 (algorithm). Wolfram MathWorld'' defines algorithm as a specific set of 
instructions for carrying out a procedure or solving a problem, usually with the require- 
ment that the procedure terminate at some point. Specific algorithms sometimes also go 
by the name method, procedure, or technique. The word "algorithm" is a distortion of 
al-Khwarizmi, a Persian mathematician who wrote an influential treatise about algebraic 
methods. The process of applying an algorithm to an input to obtain an output is called a 
computation. 



= createPop(n) 



Input: n the size of the population to be er 

Data: i a counter variable 

Output: X |M , the new, random population 

1 begin 

x <- 




programming 



while i>0 do 



- appciidLi,st(X |ni|l , create))) 



Algorithm 




Li st<IIndi vi dual> createpop(n) { 
Li st<Indi vi dual> xpop; 
xpop = new ArrayLi st<IIndi vi dual>(n) ; 
for(int i=n; i>0; i--) { 

xpop . add(createO) ; 

} 

return xpop; 



Program 

(Schematic Java, High-Level Language) 



1 PROC createpop 

2 push eax 

3 push eax 

4 call ArrayLi st :: create 

5 pop ecx 

6 01 oop 

7 push ecx 

8 push eax 

9 push eax 
call create 
push eax 

call ArrayLi st :: add 

13 pop eax 

14 pop ecx 

15 loop Oloop 

16 RET EAX 

17 END PROC 



10 
11 



Program 

(Schematic Assembly Language) 



50 01 50 01 9A 10 38 33 83 

8F 03 50 03 50 01 50 01 9A ^_ 

10 38 32 00 8F 01 8F 03 E2 ~ 

11 11 00 00 C3 00 
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z 

B 







P 

trq 



Program 

(Schematic Machine Language) 



Figure 30.1: The relation between algorithms and programs. 



While an algorithm is a set of directions in a representation which is usually understand- 
able for human beings, programs are intended to be processed by machines and are therefore 
expressed in a more machine-friendly form. Originally, this was machine code. Nevertheless, 
for more than sixty years [312], huge effort is being spent in order to allow us to write pro- 
grams in more and more comprehensible syntax. A program is basically an algorithm realized 
for a given computer or execution environment, as illustrated in Figure 30.1. The difference 
between programs and algorithms hence today lies primarily in the degree of independence 
from a given platform and the intention. 

5 http://www.nist.gov/dads/HTML/algorithm.html [accessed 2007-07-03] 

6 http://mathworld.wolfram.com/Algorithm.html [accessed 2007-07-03] 
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Definition 30.5 (Program). A program' is a set of instructions that describe a task or 
an algorithm to be carried out on a computer. Therefore, the primitive instructions of the 
algorithm must be expressed either in a form that the machine can process directly (machine 
code 8 [1622]), in a form that can be translated (1:1) into such code (assembly language [1793] , 
Java byte code [837], etc.), or in a high-level programming language 9 [1350] which can be 
translated (n:m) into the latter using special software (compiler) [1162]. 

In Genetic Programming, programs are grown and not algorithms since the evolved 
structures are always bound to one specific simulation environment. The results may be 
transformed to algorithms by removing this binding. This process can become very com- 
plicated, especially for assembly language or machine code-like programs and there is no 
automated way for doing this to the knowledge of the author. 

Definition 30.6 ((Software) Process). In terms of software, a process 10 is a program 
that is currently executed. While a program only is a description of what to do, a process 
is the procedure of actually doing it. In a program for example the number and types of 
variables are described - in a process they are allocated and used. 

Here we should also mention one of the most fundamental principle of electronic data 
processing 11 , the IPO Model 12 . As sketched in Figure 30.2, it consists of three parts: 



1. The input (IPO) is an external information or stimulus that enters the system. 

2. The processing step (IPO) is the set of all actions taken upon/using the input. In terms 
of software, these actions are performed by a process which is the running instance of a 
program. 

3. The output (IPO) comprises the results of the computation (processing phase) that leave 
the system. 

30.1.2 Properties of Algorithms 

Besides these definitions, algorithms all share the following properties. Well, with few ex- 
ceptions that we also will elaborate on. 

Definition 30.7 (Abstraction). An algorithm describes the process of solving a problem 
on a certain level of abstraction which is determined by the elementary algorithms and 
elementary objects it uses and the applied formalism. One of the most important methods 
of abstraction is the definition and reuse of sub-algorithms. 

Definition 30.8 (Discrete). A discrete algorithm works step-wise, i. e., is build on base 
of atomic executable instructions. 

7 http://en.wikipedia.org/wiki/Computer_program [accessed 2007-07-03] 

8 http://en.wikipedia.org/wiki/Machine_code [accessed 2007-07-04] 

9 http://en.wikipedia.org/wiki/High-level_programming_language [accessed 2007-07-03] 
10 http : //en. wikipedia. org/wiki/Process_y,28computing'/,29 [accessed 2007-07-03] 

http : //en . wikipedia . org/wiki/Electronic_data_processing [accessed 2007-07-03] 
12 http://en.wikipedia.org/wiki/IPO_Model [accessed 2007-07-03] 




Figure 30.2: A process in the IPO model. 
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Definition 30.9 (Finite). The definition of a (static) finite algorithm has a limited length. 
The sequence of instructions of static finite algorithms is thus finite. During its execution, a 
(dynamic) finite algorithm uses only a limited amount of memory to store its interim results. 

Definition 30.10 (Termination). Each execution of an algorithm terminates after a finite 
number of steps and returns its results. 

Definition 30.11 (Determinism). In each execution step of a deterministic algorithm, 
there exists at most one way to proceed. If no way to proceed exists, the algorithm has 
terminated. 

Deterministic algorithms do not contain instructions that use random numbers in order 
to decide what to do or how to modify data. Most of the optimization techniques included 
in this book are randomized algorithms. They hence are not deterministic. We give an 
introduction into this matter in Definition 30.18 on page 552. 

Definition 30.12 (Determined). An algorithm is determined if it always yields the same 
results (outputs) for the same inputs. 

30.1.3 Complexity of Algorithms 

For most problems, there exists more than one approach that will lead to a correct solution. 
In order to find out which one is the "best", we need some sort of metrics which we can 
compare [2166, 2223]. 

The most important measures obtained by analyzing an algorithm 1,5 are the time that 
it takes to produce the wanted outcome and the storage space needed for internal data 
[446]. We call those the time complexity and the space complexity dimensions. The time- 
complexity denotes how many steps algorithms need until they return their results. The 
space complexity determines how much memory an algorithm consumes at most in one run. 
Of course, these measures depend on the input values passed to the algorithm. If we have an 
algorithm that should decide whether a given number is prime or not, the number of steps 
needed to find that out will differ if the inputs are 1 or 2 32582657 — 1. Therefore, for both 
dimensions, the best-case, average-case, and the worst-case complexity exist. 

In order to compare the time and space requirements of algorithms, some approximative 
notations have been introduced [1159, 1894, 1162]. As we just have seen, the time and space 
requirements of an algorithm normally depend on the size of its inputs. We can describe this 
dependency as a function of this size. In real systems however, the knowledge of the exact 
dependency is not needed. If we, for example, know that sorting n data elements with the 
Quicksort algorithm 14 [933, 1163] takes in average something about nlog 2 n steps, this is 
sufficient enough, even if the correct number is 2nlnn w 1.39nlog 2 n. 

The Big-O-family notations introduced by Bachmann [96] and made popular by Landau 
[1236] allow us to group functions together that rise at approximately the same speed. 

Definition 30.13 (Big-O notation). The big-O 15 notation is a mathematical notation 
used to describe the asymptotical upper bound of functions. 

f(x) e 0(g(x)) «]i ,meM:m>0A \f(x) | < m\g(x) | V x > x (30.1) 

In other words, a function f(x) is in O of another function g(x) if and only if there exists 
a real number Xq and a constant, positive factor m so that the absolute value of f(x) is 
smaller (or equal) than m-times the absolute value of g(x) for all x that are greater than xo. 

13 http://en.wikipedia.org/wiki/Analysis_of_algorithms [accessed 2007-07-03] 

14 http://en.wikipedia.org/wiki/Quicksort [accessed 2007-07-03] 

15 http://en.wikipedia.org/wiki/Big_O_notation [accessed 2007-07-03] 
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Therefore, x 3 + x 2 + x + 1 = f(x) G 0(x 3 ) since for m = 5 and xo = 2 it holds that 
5x 3 > x 3 + x 2 + x + 1 Vx > 2. 

In terms of algorithmic complexity, we specify the amount of steps or memory an algo- 
rithm needs in dependency on the size of its inputs in the big-O notation. A discussion of 
this topic and some examples can be found in Table 30.1. 
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Algorithms with exponential complexity per- 
form slowly and fast become unfeasible with in- 
creasing input size. For many hard problems, 
there exist only algorithms of this class. Their 
solution can otherwise only be approximated by 
the means of randomized global optimization 
techniques. 



Table 30.1: Some examples of the big-O notation 



Definition 30.14 (Big-f2 notation). The big-f2 notation is a mathematical notation used 
to describe the asymptotical lower bound of functions. 

f(x) € ft(g(x)) <^3x ,meR:m>OA \ f(x) | > m\g(x) | V x > x (30.2) 

f{x) G n(g(x)) <^ g(x) E 0(/(x)) (30.3) 

Definition 30.15 (0 notation). The notation is a mathematical notation used to 
describe both, an upper and a lower asymptotical bound of functions. 



f{x) e &(g(x)) & f{x) e O(gix)) A fix) G f%(x)) 



(30.4) 



Definition 30.16 (Small-o notation). The small-o notation is a mathematical notation 
used to define that a function is asymptotical negligible compared to another one. 



fix) G o( 5 (x)) <^> lim 
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Definition 30.17 (Small-w notation). The small-o; notation is a mathematical notation 
used to define that another function is asymptotical negligible compared to a special function. 



f(x) e oj(gx) lim 



oo (30.6) 



9{x) 

f(x) € u(g(x)) o g(x) e o(f(x)) (30.7) 
30.1.4 Randomized Algorithms 

Deterministic algorithms 1 '' will always produce the same results when given the same inputs. 
Such behavior comes closest to the original intention behind the definition of algorithms. 
The execution of a recipe should always yield the same meal, sorting identical lists should 
always result in, again identical, sorted lists. So in general, algorithms are considered to be 
deterministic. For many problems however, deterministic algorithms are unfeasible. In global 
optimization (see Section 1.1.1 on page 22), the problem space X is often extremely large 
and the relation of an element's structure and its utility as solution is not obvious. Hence, 
the search space G often cannot be partitioned wisely and an exhaustive search would be the 
only deterministic option left. Such an approach would take an infeasible long time. Here, 
the only way out is using a randomized algorithm. 

Definition 30.18 (Randomized Algorithm). A randomized algorithm 1 ' includes at 
least one instruction that acts on the basis of random numbers. In other words, a randomized 
algorithm violates the constraint of determinism. Randomized algorithms are also often 
called probabilistic algorithms [1473, 965, 1438, 964, 1474]. 

There arc two general classes of randomized algorithms: Las Vegas and Monte Carlo 
algorithms. 

Definition 30.19 (Las Vegas Algorithm). A Las Vegas algorithm 18 is a randomized 
algorithm that never returns a wrong result [86, 1473, 965, 964]. 

Either it returns the correct result, reports a failure, or does not return at all. If a Las 
Vegas algorithm returns, its outcome is deterministic (but not the algorithm itself). The 
termination (see Definition 30.10 on page 550) however cannot be guaranteed. There usually 
exists an expected runtime limit for such algorithms - their actual execution however may 
take arbitrarily long. In summary, we can say that a Las Vegas algorithm terminates with 
a positive probability and is (partially) correct. 

Definition 30.20 (Monte Carlo Algorithm). A Monte Carlo algorithm 19 always ter- 
minates. Its result however can be correct or incorrect [1473, 965, 964]. In contrast to Las 
Vegas algorithms, Monte Carlo algorithms always terminate but are (partially) correctly 
only with a positive probability. 

Definition 30.21 (Monte Carlo Method). Monte Carlo methods 20 are a class of Monte 
Carlo algorithms used for simulating the behavior of systems of different types. Therefore, 
Monte Carlo methods arc nondcterministic and often incorporate random numbers [845, 
1339, 1744, 1294]. 



16 http://en.wikipedia.org/wiki/Deterministic_computation [accessed 2007-07-03], 
also Definition 30.11 on page 550 

17 http://en.wikipedia.org/wiki/Randomized_algoritnm [accessed 2007-07-03] 

18 http://en.wikipedia.org/wiki/Las_Vegas_algorithm [accessed 2007-07-03] 

19 http://en.wikipedia.org/wiki/Monte_carlo_algorithm [accessed 2007-07-03] 

20 http://en.wikipedia.org/wiki/Monte_Carlo_method [accessed 2007-07-03] 
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30.2 Distributed Systems and Distributed Algorithms 

Various definitions have been issued for the terms distributed system and distributed algo- 
rithms by several researchers such as Bal [122], Lamport [1235], Tanenbaum and van Steen 
[2006], Mattern [1370], Tel [2010], Barbosa [146], Coulouris et al. [457], Ghosh [799], and 
Miihl [1475]. These definitions most often only differ in minor details and can be summarized 
as follows. 

Definition 30.22 (Distributed System). A distributed system is a set of autonomous 
systems (nodes) which are connected by a network and communicate via the exchange of 
messages. [1475] 

Definition 30.23 (Distributed Algorithm). Distributed algorithms [1370, 2010, 146] 
are algorithms which are executed by multiple computers in a distributed system and coop- 
eratively try to solve a given problem. 

Distributed algorithms can be distinguished from sequential algorithms because they 
run on multiple nodes in parallel in order to cooperatively solve one problem. They can 
be distinguished from mere parallel algorithms since each node in the distributed system 
executes instances of the same algorithm with a (usually) different view on the global state 
[2010, 122, 2006]. 

The reason for this lack of a common view on the global state is that each node has only 
knowledge about the information locally available on it. Information on the other nodes can 
only be obtained via communication which usually comprises the exchange of messages. 

Latency is the time difference between the moment where something is initiated and 
the moment when its effects becoming observable [457]. Communication usually involves 
latency. Whenever a process sends a message, its contents are handed down to the operating 
system or a middleware. From this moment on, the process considers the message as sent. 
The operating system now must initialize the communication, prepare the message for the 
transmission medium, and send it to its destination(s). 

The laws of physics induce an additional delay, preventing the message from instanta- 
neously occurring at its target. This delay is normally negligible. Yet it is observable in 
satellite communication, for example when the host of a news show talks with a reporter on 
the other side of the globe. 

Once a message arrives at the destination node, it is reassembled from the medium and 
the operating system or middleware passes its contents to the receiving process. From the 
moment on where the execution of this process is resumed, the message is considered as 
received. 

Of course, with technical effort such as special clocking, latency could be made transpar- 
ent for the system. In general computer networks (let alone MANETs or sensor networks) 
this is not possible and messages are always delayed. Because of this latency, the nodes can- 
not have exactly the same view on the world and it is not possible to have an exact, globally 
synchronized system time available. Furthermore, networks may induce arbitrary errors into 
the message's content and messages can even get lost, i. e., have an infinite latency. 

Distributed algorithms can provide the following advantages (depending on their design) : 
modularity, flexibility, resource-sharing, no central point of failure, scalability because of 
decentralization, robustness, high availability, and fault-tolerance. In turn, they may have the 
following drawbacks, again depending on their design: higher complexity, no common view 
on the global state, no global time, processes may fail, latency and faults in communication, 
problems in termination detection, deadlocks, and race conditions. 

Whether a distributed algorithm is adequate or not depends on the degree to which it 
exploits the advantages of the distribution and how strongly the mentioned drawbacks are 
present in its design. The quality of an adequate distributed algorithm can be determined 
by its functionality, its communications complexity, i. e., how many messages need to be 
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exchanged in order to solve its task, or its time complexity, i. e., how many computational 
steps need to be performed on the single nodes. 

TODO 

Definition 30.24 (Scalability). Scalability 21 is a measure describing how good a system 
can grow or be extended for processing a higher computational load. 

Definition 30.25 (Central Point Of Failure). A central (or single) point of failure is a 
subsystem or process that, if it fails, leads to the collapse of the whole distributed system. 
An example for central point of failures is central servers. 

Definition 30.26 (Bottleneck). The bottleneck 22 of a distributed application is the part 
that has the most limiting influence on its performance. 

Imagine, for instance, an hourglass. Here, the dilution in its center is the bottleneck that 
limits the amount of sand that can fall down per time unit. 

TODO 



30.2.1 Network Topologies 

Definition 30.27 (Network Topology). Network topology 23 is the study of arrangement 
of the components of a network such as connections and nodes. The network layout itself 
can also be referred to as topology. 

In the further text, we will use the term edge synonymously for link and connection and 
the term vertex as synonym for node or computer since network topology is closely related 
to graph theory. 

Each computer network has exactly one physical topology which is the layout of its 
physical components (computers, cables). This physical structure defines which nodes can 
communicate directly with each other and which not. On top of that physical design, several 
virtual/overlay topologies may be built. 

Definition 30.28 (Overlay Network). An overlay network 24 is a virtual network which 
is built on top of another computer network. The nodes in the overlay network are connected 
by virtual or logical links [50]. 

IP addresses 25 , for instance, form an overlay topology on top of MAC addresses 26 in Ether- 
nets 27 . A peer-to-peer network is an overlay network because it runs on top of the internet. 
Several distributed algorithms require the nodes to be arranged in special topologies like 
stars or rings. This can be achieved in arbitrary networks by defining an overlay structure 
which performs according routing and address translations. 

When speaking of topology, one would normally think about a hardwired network of 
computers, connected with each other through Ethernet cabling and such and such. If we 
consider a WLAN 28 or a wireless sensor network as described in Definition 30.32 on page 559 

21 http://en.wikipedia.org/wiki/Scalability [accessed 2008-02-os) 

22 http://en.wikipedia.org/wiki/Bottleneck [accessed 2007-07-03] 

23 http://en.wikipedia.org/wiki/Network_topology [accessed 2007-07-03] 

24 http://en.wikipedia.org/wiki/Overlay_network [accessed 2007-07-03] 

25 http://en.wikipedia.org/wiki/IP_address [accessed 2008-02-09] 

26 http://en.wikipedia.org/wiki/Mac_address [accessed 2008-02-09] 

27 http://en.wikipedia.org/wiki/Ethernet [accessed 2008-02 09] 

28 http://en.wikipedia.org/wiki/Wireless_LAN [accessed 2008-02-08] 
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on the other hand, there is of course no such thing as cabling. But still, there is a certain 
topology: not all nodes may be able to directly contact each other since their radio trans- 
mission ranges are limited. They may only be able to exchange messages directly with some 
nodes in their physical neighborhood only. Hence, we can span a graph over this network, 
where each node is connected to his neighbors in communication range only. This graph 
then defines the topology. 




Fig. 30. 3. a: unrestricted Fig. 30. 3. b: bus Fig. 30. 3. c: star 

topology 




Fig. 30.3.d: ring Fig. 30.3.e: hierarchy Fig. 30.3.f: grid 




Fig. 30.3.g: fully 
connected 



Figure 30.3: Some simple network topologies 



Unrestricted 

In an unrestricted network topology as the one sketched in Fig. 30. 3. a, we make only the 
general assumption that there is no network partition. In other words, for all nodes n in the 
network, there exists at least one path to each other node in the network. This path may, 
of course, consist of multiple hops over multiple connections. 

Bus 

All nodes in a bus system (illustrated in Fig. 30. 3. b) are connected to the same transmission 
medium in a linear arrangement. All messages sent over the medium can be considered to 
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be broadcasts that potentially can be received by all nodes more or less simultaneously. The 
transmission medium has exactly two ends. 

Star 

Fig. 30. 3. c shows an example for a star topology. Here, all nodes are connected to a single 
node in the center of the network. This center could, for example, be an Ethernet hub 29 or 
switch 30 that retransmits the messages received to their correct destination. It could as well 
be a server that performs some specific tasks for the notes. For a detailed discussion of the 
client-server architecture see Section 30.2.2. 

Ring 

In this topology, each node is connected to exactly two other nodes in a way that no partition 
exists. The nodes are arranged in a sequence where the first and the last node are connected 
with each other. An instance of the ring topology is illustrated in Fig. 30. 3. d. 

Hierarchy 

Fig. 30. 3. e illustrates a hierarchical topology where the nodes of the network are arranged 
in form of a tree. 

Grid 

The nodes in a grid are laid out in a two-dimensional lattice so that each node (except those 
at the borders of the grid) is connected with four neighbors: one to the left, one to the right, 
one above and one below. Fig. 30. 3. f is an instance of such a topology. 

Fully Connected 

In a fully connected network, as sketched in Fig. 30. 3. g, each node is directly connected with 
each other node. 

30.2.2 Some Architectures of Distributes Systems 
Client-Server Systems 

Definition 30.29 (Client-Server). Client-server 31 is a network architecture that sepa- 
rates two types of nodes: the client (s) and the server(s) [2321, 72]. A client 32 utilizes a service 
provided by a server 33 . It does so by sending a request to the server. This request contains 
details of the task to be carried out, for example the URL of a website to be returned. The 
server then executes appropriate actions and, in most cases, sends a response to the client. 
Usually, there is a small number of servers (normally one) which servers many clients. 

Client-server architectures like the one illustrated in Figure 30.4 are the most basic and 
the most common application logical architecture in distributed computing [457, 1370, 2006]. 
They are part of almost all internet applications like: 

29 http://en.wikipedia.org/wiki/Ethernet_hub [accessed 2007-07-03] 

30 http://en.wikipedia.org/wiki/Ethernet_switch [accessed 2007-07-03] 

31 http://en.wikipedia.org/wiki/Client_server [accessed 2007-07-03] 

32 http : //en. wikipedia. org/wiki/Client_y,28computing°/,29 [accessed 2007-07-03] 

33 http://en. wikipedia. org/wiki/Server_y,28computing°/,29 [accessed 2007-07-03] 
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Figure 30.4: Multiple clients connected with one server 



1. Websites' 4 in the world wide web 35 are obtained by using the HTTP 3(> protocol for 
communication between a web browser 37 and a web server 38 . 

2. Application servers 39 contain the business logic of corporations. They support online 
shops 40 with an underlying business model, for example. 

3. Database servers 41 provide computers in a network with access to large data sets. Fur- 
thermore, they allow their clients to send structured queries that allow aggregation and 
selection of specific data. 

4. ... 

The major advantages of client-server systems are their simplicity. Local algorithms can 
often be integrated into servers without too many problems while their adaptation to more 
complicated architectures is, well, more difficult and error-prone. The heaviest weakness of 
the client-server scheme is that the server represents a bottleneck (see Definition 30.25) and 
a single point of failure (see Definition 30.25 on page 554). 

Peer-To-Peer Networks 

Definition 30.30 (Peer-To-Peer Network). Instead of being composed of client and 
server nodes, a peer-to-peer 42 network consists only of equal peer nodes. A peer node works 
as a server for its fellow peers by providing certain functionality and simultaneous acts as 
client utilizing an similar service from its peers [457, 1370, 2006, 1959, 39]. Therefore, a 
peer node is often also called serveni 43 , a combination of the words server and client. The 
expression peer-to-peer is often abbreviated by P2P. 

Peer-to-peer networks may have an arbitrary structure like the one sketched in 
Figure 30.5. While client-server systems are limited to providing communication between 
the clients and the server solely, peer-to-peer networks may resemble any sort of underlying 
communication graph. 

Peer-to-peer architectures circumvent the existence of single points of failure and can be 
constructed to be very robust against bottlenecks. They furthermore are often ad hoc, i. e., 



34 http://en.wikipedia.org/wiki/Website [accessed 2007-07-03] 

35 http://en.wikipedia.org/wiki/Www [accessed 2007-07-03] 

36 http://en.wikipedia.org/wiki/Http [accessed 2007-07-03] 

37 http://en.wikipedia.org/wiki/Web_browser [accessed 2007-07-03] 

38 http://en.wikipedia.org/wiki/Web_server [accessed 2007-07-03] 

39 http://en.wikipedia.org/wiki/Application_server [accessed 2007-07-03] 

40 http://en.wikipedia.org/wiki/Online_shop [accessed 2007-07-03] 

41 http://en.wikipedia.org/wiki/Database_server [accessed 2007-07-03] 
http://en.wikipedia.org/wiki/Peer-to-peer [accessed 2007-07-03] 

43 http://en.wikipedia.org/wiki/Servent [accessed 2007-07-03] 
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Figure 30.5: A peer-to-peer system in an unstructured network 



new peers may join the network at any time and leave it whenever they decide to do so. 
This can also be regarded as a drawback since the structure (and thus, its computational 
power and connectivity) of network may fluctuate heavily as well as the availability of data 
provided by the peers. 

If obeying the definition exactly, there are no centralized components in a peer-to-peer 
network. There however exist hybrid networks where the peers for example register at a 
dedicated server which keeps track on the users online. Also, there exist different hierarchical 
or non-hierarchical overlay networks. 

Important peer-to-peer-based applications are 

1. File and content sharing systems [60, 1807] are the most influential and wide-spread 
P2P systems. Millions of users today share music, videos, documents and software over 
networks like Gnutella 44 [420, 1740], Bittorrent 45 , appleJuice 46 and the famous but shut- 
down Napster 4 ' network. 

2. Many scientific applications like Seti@home 48 , Einstein@home 49 , and Folding@home 50 
rely on users all over the world that voluntarily provide their unused computational 
resources. Such applications are most often constructed as screensavers that, after be- 
coming active, download some pieces of data from a server and perform computations 
on them. After finishing the work on the received data, a response is issued to the server. 

3. Many instant messaging 51 systems like talk' 2 utilize peer-to-peer protocols . Most often, 
the clients need to log on and send status information to a server. Communication then 
either works client-server-based or in P2P-manner. Especially when audio or video chats 
come into play, peer-to-peer approaches are usually preferred. 



44 http://en.wikipedia.org/wiki/Gnutella [accessed 2007-07-03] 

45 http://en.wikipedia.org/wiki/BitTorrent [accessed 2007-07-03] 

46 http://www.applejuicenet.de/ [accessed 2007-07-03] 

47 http://en.wikipedia.org/wiki/Napster [accessed 2007-07-03] 

48 http://en.wikipedia.org/wiki/Seti_at_home [accessed 2007-07-03] 

49 http://en.wikipedia.org/wiki/Einsteiny.40Home [accessed 2007-07-03] 

50 http://en.wikipedia.org/wiki/Folding7.40home [accessed 2007-07-03] 

51 http://en.wikipedia.org/wiki/Instant_messaging [accessed 2007-07-03] 

52 http://en.wikipedia.org/wiki/Talk_7.28Unix7.29 [accessed 2007-07-03] 
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Sensor Networks 

Definition 30.31 (Sensor Network). A sensor network"' 3 [1012, 1966, 469, 1025] is a net- 
work of autonomous devices which are equipped with sensors and together measure physical 
entities like temperature, sound, vibrations, pressure, motion, or such and such. 

Definition 30.32 (Wireless Sensor Network). A wireless sensor network (WSN) [1697, 
326, 2317, 1092, 1873] is a sensor network where the single nodes are connected wirelessly, 
using techniques like wireless LAN 54 , Bluetooth 0,5 , or radio 56 . 




Figure 30.6: A block diagram outlining building blocks of a sensor node. 



Figure 30.6 sketches the building blocks of a sensor node. For communication with other 
nodes, short range radios, Bluetooth, or wireless LAN adapters are often added. The core 
of a sensor node is a microcontroller attached with RAM and ROM memory for code and 
data. The purpose of senor networks is to measure some environmental parameters like tem- 
perature, humidity, or brightness. Thus, sensor nodes have one or multiple sensors attached. 

Since they are autonomous devices usually not connected to power lines, sensor nodes 
have to be equipped with some sort of energy source. Chemical batteries are used to store 
energy, but often power scavenging units [574, 1769, 1610, 1698] like, for example, solar cells 
[1881, 1050, 1537], thermal [1988] or kinetic energy harvesters [2149, 1609, 1229] are added. 
The field of energy supply of sensor nodes is critical and subject to active research [1843, 688, 
382, 767]. Batteries have limited capacity and are hard to replace after the network has been 
deployed. If no additional power scavenging unit is available, the sensor nodes will eventually 
stop functioning and become useless after all their energy is consumed. For extending this 
lifetime, energy intense operations like communication via radio transmissions need to be 
reduced as much as possible. 

The size of the sensor nodes ranges from shoe box down to matchbox dimensions. There 
is a strong wish to produce smaller and smaller nodes. Small sensors are recognized less 

53 http://en.wikipedia.org/wiki/Sensor_network [accessed 2007-07-03] 

54 http://en.wikipedia.org/wiki/Wireless_lan [accessed 2007-07-03] 

55 http://en.wikipedia.org/wiki/Bluetooth [accessed 2007-07-03] 

56 http://en.wikipedia.org/wiki/Radio [accessed 2007-07-03] 
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obviously and blend better in their environment. Since they require fewer raw material, 
they might become much cheaper than their larger pendants. On the other hand, with this 
movement in the direction of sensor that are really tiny, some hard constraints arise. The 
size of the battery limits the amount of energy that can be stored, as well as the extent of 
a solar cell limits its energy production. The node size also restricts the dimensions of the 
memory and the sensors of the node [397] . 

Other important research topics are data fusion and transportation in a WSN [611, 846] 
as deployment and maintenance [2009, 957]. 

Widespread sensor node architectures are: 

1. BTNodes 57 are autonomous wireless communication and computing platforms based 
on a Bluetooth radio and a microcontroller. Developed at the ETH Zurich, BTNodcs 
serve especially as demonstration, teaching, and research platforms. Fig. 30. 7. a shows a 
BTNode. 

2. Crossbow's MICA2 58 motes are multipurpose nodes. These systems are applied widely 
in real-world applications like environmental control in agriculture and outdoor sports 
as well as for indoor sports and military purposes. A picture of the Mica2Dot platform 
can be found in Fig. 30. 7. b. 

3. Scatterweb 59 provide both, a research platform (MSB nodes, illustrated in Fig. 30. 7. c) 
and an industrial sensor network (Scatter Nodes). 

4. Dust Networks*' developed their SmartMesh for building wireless solutions for the global 
market. Their nodes use the Time Synchronized Mesh Protocol and middle-range radio 
to provide the reliability of a typical WLAN in their sensor networks. Fig. 30. 7. d shows 
a Dust Networks Evaluation Mote. 

5. ... 

A small example application demonstrating the use of sensor networks is discussed 
in Section 24.1.2 on page 414. 

Properties of Peer-To-Peer Systems and Sensor Networks 

1. Current peer-to-peer networks are often large-scale, with tens of thousands [1807] up 
to millions [2262] of users/nodes online. Although networks of thousands of sensors are 
a future goal, the number of nodes in sensor networks has not yet reached this extent. 
However, systems of several hundreds of nodes have already been deployed [468, 1990]. 

2. Since wireless sensor networks have limited transmission range, it is possible that not all 
nodes in a network can communicate directly with each other. The same issue exists in 
the internet, but there it is solved in a transparent manner by routers. In sensor networks 
however, dedicated hardware routers normally do not exist. Therefore, special routing 
protocols [1978, 1387, 386] are applied. Here we sec a strong relation between sensor 
networks and peer-to-peer systems: Each sensor may act as a sender of messages as well 
as router for other nodes. There is no generic hierarchy or division between senders or 
routers. 

3. Especially in peer-to-peer applications, there are strong fluctuations in the network mem- 
bership. In content sharing networks for example, new users continuously join and leave 
the network. In sensor networks on the other hand, volatility in the network structure 
arises from newly deployed nodes or nodes that become inactive because they ran out 
of battery power. A sensor node spends much of its time in sleep mode (so do I) and 
may be regarded as inactive in this time. When it triggers back to active mode, it again 

57 http://www.btnode.ethz.ch/ [accessed 2007-07-03] 

58 http : //www. xbow. com/Products/productdetails . aspx?sid=156 [accessed 2007-07-03] 

59 http://www.inf.fu-berlin.de/inst/ag-tech/scatterweb_net/ [accessed 2007-07-03] and http:// 
www . scatterweb . com/ [accessed 2007-07-03] 

60 http : / /www . dustnetworks . com/ [accessed 2007-07-03] 
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Figure 30.7: Images of some sensor network platforms 



becomes member of the network. Furthermore, networks of mobile sensors have large 
fluctuations in their topology per default. 

4. Since sensor networks utilize sleep cycles in order to reduce energy consumption, mes- 
sages that are routed may arbitrarily be delayed or even get lost. 

5. P2P networks often represent very heterogeneous environments, consisting of computers 
of different architectures and operating systems. Sensor networks on the other hand are 
most often homogeneous systems. 

30.3 Grammars and Languages 

Languages are the most important means for communication between higher animals 61 . 
Formal languages can also be used define the formats for data being stored by or exchanged 
between computers and/or human beings. When analyzing a statement in a given language, 
we distinguish between its syntax and semantic. 

Definition 30.33 (Syntax). The syntax'' 2 of a language is the set of rules that governs 
its structure. Each valid statement of a language must obey its syntactical structure. The 
sentence "I am reading a book." is a sequence of a subject, a predicate, and an object. 

Definition 30.34 (Semantic). The semantic 63 refers to the meaning of a statement. The 
sentence "I am reading a book. " has the meaning that the writer of it is visually obtaining 
information from a set of bounded pages filled with written words. 

61 http://en.wikipedia.org/wiki/Language [accessed 2007-07-04] 

62 http://en.wikipedia.org/wiki/Syntax [accessed 2007-07-03] 

63 http://en.wikipedia.org/wiki/Semantics [accessed 2007-07-03] 
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30.3.1 Syntax and Formal Languages 

Let us now take a closer look on the syntax of formal languages [381, 1166]. 

Definition 30.35 (Alphabet). A finite set £ of symbols (characters) a <E £ with a total 
order (see Section 27.7.2 on page 463) defined on it is called an alphabet. 

Definition 30.36 (Character String). A character string 64 (or word) over an alphabet 
£ is any finite sequence of symbols a e £. Character strings have the following properties: 

1. The empty character string e is a character string over £. 

2. If x is a character string over £, then ax is also a character string over £ for all a E £. 

3. (3 is a character string over the alphabet £ if and only if it can be created using the two 
rules above. 

Definition 30.37 (Concatenation). The concatenation 6 ' 1 aofl of two character strings 
a = a^a^a^. . .a n and (3 = /3i/?2/?3- • -Pm over the alphabet £ is the character string aofi ~ 
0L\(X2Oi3--OL n (3il32l33..(3 m which begins with a immediately followed (and ended by) (3. 

The set of all strings of length I over the alphabet £ is called £ l with £° = {e} \/£. 
The set of all strings on £ is called £*, i. e., £* = Wj*L £ l . It is also called Kleene star 66 
(or Kleene closure). 

Definition 30.38 (Lexeme). A lexeme 67 is the lowest level of syntactical unit of a language 
[381]. It denotes a set of words that have the same meaning, like run, runs, ran, and running 
in English. A lexeme belongs to a particular syntactical category and has a semantic meaning. 

Based on these definitions, we can consider a sentence to be a sequence of lexemes which, 
in turn, are character strings over some alphabet. 

Definition 30.39 (Language). A language L over the alphabet £ is a subset of £* [1166]. 
L is the set of all sentences over an alphabet £ that are valid according to its rules in syntax 
(the grammar) [395]. 

When describing the formal syntax of a language, there are two possible approaches: 

1. We can define recognizers that determine the structure of a sentence and can decide 
whether it belongs to the language or not. Recognizers are, for instance, used in compilers 
[865]. 

2. A generative grammar can be defined from which all possible sentences of a language 
can be constructed. 



30.3.2 Generative Grammars 

A generative grammar G of a language L is a formal specification that allows us to construct 
every single sentence in L by applying recursive replacement rules. Therefore, we define 
non-terminal symbols (also called variables) which do not occur in the language's text and 
terminal symbols that do. One example of such a grammar is: 

sentence > subject verb object 

subject > Alice V Bob 

verb > writes V reads 

object > cipher-text V plain text 

Listing 30.1: A simple generative grammar. 



http : //en . wikipedia . org/ wiki/Character_str ing [accessed 2007-07-03] 
http : //en. wikipedia. org/wiki/Concatenation [accessed 2007-07-10] 
http://en.wikipedia.org/wiki/Kleene_star [accessed 2007-07-03] 
http : //en . wikipedia . org/ wiki/Lexeme [accessed 2007-07-03] 
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Here we have four productions, the terminal symbols Alice, Bob, writes, reads, 
cipher-text, plain-text, plus five non-terminal symbols (sentence, subject, verb, and 
object). 

Definition 30.40 (Formal Grammar). A formal grammar 68 G — (N,E,P,S) is a 4- 
tuple consisting of: 

1. a finite set N of non-terminal symbols (variables), 

2. the alphabet E, a finite set of terminal symbols, 

3. a finite set P of productions (also called rules), and 

4. at least one start symbol S £ N which belongs to the set of non-terminal symbols N. 

Notice that the alphabet E here is not limited to letters or numerals, but may contain words, 
sentences, or even arbitrarily long texts. Additionally, we call the set V — N U E including 
terminal and non-terminal symbols the grammar symbols. 

The Chomsky Hierarchy 

The Chomsky hierarchy stands for a hierarchy of formal grammars that generate a formal 
language. It was first described by the linguist Chomsky [394] in 1956 [394, 396, 1175] and 
distinguishes four different classes of grammars. Starting with an unbounded grammar (type- 
0), more and more restrictions are imposed on the allowed production rules. Hence, each 
type contains all grammar types on higher levels fully. 



Grammar 


Allowed Rules 


Languages 


Type-0 


a — > /3,a, (3 € V* , a =fc e 


recursive enumerable 


Type-1 


aA/3 -» or/ /3, A £ N, a,/?, 7 £ V* , 7 / e 


context-sensitive 






(CSG) 


Type-2 


A -> 7, A £ N, 7 £ V* 


context-free () 


Type-3 


A — > aB (right-regular) or A — > Ba (left- 


regular 




regular), A — > a, A, B £ N, a £ E 





Table 30.2: The Chomsky Hierarchy 



Table 30.2 illustrates the Chomsky hierarchy. As already mentioned, V is the set con- 
taining all terminal and non-terminal symbols and V* is its Kleene closure. 

30.3.3 Derivation Trees 

A derivation tree 69 is a common way to describe how a sentence in a context-free language 
can be derived from the start symbol of a given generative grammar. The inner nodes of a 
derivation tree are the non-terminal symbols in N, the root is the start symbol S, and the 
leaves are the terminal symbols from the alphabet E. Each edge constitutes one expansion 
according to a production of the grammar. 

Assume an example grammar G = (N,E,P,S) with N = {T}, E = {l,+,a}, S = T, 
and the productions P as defined the below. 

1 T > T + T 

2 T > 1 

3 T > a 

Listing 30.2: An example context-free generative grammar G. 

68 http://en.wikipedia.org/wiki/Formal_grammar [accessed 2007-07-03] 

69 http://en.wikipedia. org/wiki/Context-f ree_grammar#Derivations_and_syntax_trees [ac- 
cessed 2007-07-16] 
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With this grammar we can construct the following sentence: 



T 

T + T 
T + T + T 
a+T + T 
a + l + T 



T+T 
T + T + T 
a + T + T 
a + l + T 
a + l + a 



Listing 30.3: An example expansion of G. 



/ 



\ 



non-terminal 
Symbols (N) 



terminal 

a + 1 + a Symbols (S) 

Figure 30.8: The derivation of the example expansion of the grammar G. 



Figure 30.8 illustrates the derivation tree that belongs to this example expansion of the 
example grammar G. 

30.3.4 Backus-Naur Form 

The Backus-Naur (BNF) form' is a metasyntax used to express context-free grammars 
[109, 1156]. Such Chomsky Type-2 grammars are the theoretical basis of most common 
programming languages and data formats, like for example C and XML' 1 . The BNF allows 
specifying production rules in simple, human and machine-understandable manner. 

In BNF specifications, each rule consists of two parts: a non-terminal symbol on the 
left-hand side and an expansion on the right-hand side. Non-terminal symbols are contained 
in arrow brackets and terminal symbols are written plain. For expansions, the BNF provides 
two constructs: a sequence of symbols and the alternative which is denoted with a pipe 
character "I". 

Beginning with the start symbol S — S, the example below allows us to generate arbitrary 
natural numbers from N. A nonZero is either 1,2,.., or 9 and a normal number may also be 
zero. A natural number is either a nonZero number or a natural number with a number at 
the end. Notice that expanding nonZero will always lead to the first digit being a non-zero 
digit since a fully expanded rule cannot contain any variables (non-terminal symbols). As 
start symbol S — S, we use natural. 

<nonZero> ::=1|2|3|4|5|6|7|8|9 
<number> : := I <nonZero> 

<natural > ::= <nonZero> I <natural > <number> 
<S> : : = <natural > 

Listing 30.4: Natural numbers - a small BNF example. 



http : //en. wikipedia. org/wiki/Backus'/,E2 , / 1 80y,93Naur_f orm [accessed 2007-07-03] 
http : //www . w3 . org/TR/2006/REC-xml-20060816/ [accessed 2007-07-03] 
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The extended Backus-Naur form 72 is an extension of the BNF metasyntax that provides 
additional operators and simplifications [722, 1623, 1166]. Unlike the Backus-Naur form, the 
terminal symbols arc included in quotation marks and the non-terminal symbols are written 
without arrow brackets. The items of sequences can now be separated by commas and each 
rule ends with a semicolon. The EBNF adds options, which are denoted by square brackets. 
The sequence inside such options may either occur zero or one time in the expanded rule. 
Curly brackets define expressions that can be left away or repeated arbitrary often during 
expansion. 

The example below demonstrates the application of these new features by providing 
a grammar for natural numbers equal to the one shown for the BNF. The rules natural 
and natural2 are equivalent. Here we also specify a rule for all integer numbers from Z by 
prefixing a natural number with an optional -. 

nonZero ::= "1" I "2" I "3" I "4" I "5" I "6" I "7" I 



The ISO norm ISO/IEC 14977 [722] for EBNF defines additional extension mechanisms 
which we will not discuss here. 

30.3.6 Attribute Grammars 

An attribute grammar 73 (AG) is a context-free grammar enriched with attributes, rules, and 
conditions for these attributes [1157, 1158, 1160, 1596]. With attributes attached to non- 
terminal symbols, it becomes possible to provide context-sensitive information. Attribute 
grammars are often used in compilers to check rules that cannot be validated with the means 
of mere context-free grammars. With attribute grammars, syntax trees can be translated 
directly into intermediate languages or into code for some specific machine. 
An attribute grammar AG — (G, A, R) consists of three components: 

1. a context-free grammar G, where G — (N,E,P,S) as specified in Definition 30.40 on 
page 563, 

2. a finite set of attributes A where each attribute a e A has a set of possible values 
a = {ai,a 2 , -,a n }, and 

3. a set of semantic rules R. 

To each grammar symbol X e V, a finite set of attributes A(X) C A is associated. This 
set is partitioned into two disjoint subsets, the inherited attributes I(X) C A(X) and the 
synthesized attributes T(X) C A(X). The value of a synthesized attribute is determined by 
the attributes attached to the children of the symbol in the derivation tree it is assigned to. 
Inherited attributes get their value from the parent or siblings of the symbols they belong 
to. In the original definition by Knuth [1157], this was the other way round but the form 
discussed here has prevailed [1596]. The start symbol S £ N and the terminal symbols in S 
do not have inherited attributes (I(S) = 0, Ver e £ => 1(a) = 0). 

A good example for synthesized attributes is given in [21] from where I will borrow. AGs 
are most often not used as generative grammars but as guidelines for parsers that read for 
instance source code of a programming language. 



S 



number 
natural 
natur al2 
integer 



"8" I "9" ; 
= "0" I nonZero ; 
= nonZero I natural , number ; 
= nonZero I nonZero , {number} ; 
= ["-"], natural I "0" ; 
= integer ; 



Listing 30.5: Integer numbers - a small EBNF example. 



http : //en . wikipedia . org/ wiki/Ebnf [accessed 2007-07-03] 

http : //en. wikipedia. org/wiki/Attribute_grammar [accessed 2007-07-03] 
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Let us consider a simple grammar for integer mathematics with the two expressions + 
and *. 

1 E : : = F " + " E I 

2 F 

3 F : : = integer "*" F I 

4 integer 

Listing 30.6: A simple context-free grammar. 

For each symbol X in V let X.val be the numeric value associated with it. For terminal 
symbols of the type integer, this is simply the lexeme provided by the lexical analyzer. The 
two other terminal characters + and * have no value assigned. The values of the non-terminal 
symbols E and F should be the results of the expressions defined by them. These attributes 
are computed (synthesized) by the semantic rules from the attributes of their child nodes. 

i Production Rule 



2 E ::= F " + " E I E . val = F . val + E 2 . val 

3 F E . val = F . val 

4 F ::= integer "*" F I F . val = integer . val * F2 . val 

5 integer F . val = integer . val 

Listing 30.7: A small example for attribute grammars. 




E r val = F ;i .val + E 2 .val = 210 

F.,.val = integer 7 .val * F 4 .val = 21 

Fj.val = integer 8 .val = 3 

E r val = F 3 .val = 10 

F r ,.val = integer B .val * F {j .val = 10 

F (j .val = integer,,. val = 5 

integer,, val = 7 

integer s .val = 3 

integer„.val = 2 

integer,,. val = 5 



Figure 30.9: An instantiation of the grammar from Listing 30.7. 



Figure 30.9 illustrates the derivation tree of a sentence of the simple attribute gram- 
mar from Listing 30.7. The non-terminal symbols are sometimes annotated with subscript 
numbers (like E2) which have no meaning and only serve for clarity. While this Listing 30.7 
is an example for the usage of synthesized attributes, symbol tables used in compilers are 
instances of inherited attributes. 

A special form of attribute grammars, the reflective attribute grammar, is the basis of 
the Gads 2 Genetic Programming system discussed in Section 4.5.7 on page 185. 

L- Attributed Grammars 

L-attributed grammars' 4 are a class of attribute grammars that can be parsed in one left-to- 
right traversal of the abstract syntax tree (see Section 4.1.1 on page 158). Such grammars are 
the foundations for many programming languages and allow convenient top-down parsing' 5 . 

74 http://en.wikipedia.org/wiki/L-attributed_grammar [accessed 2007-07-04] 

75 http://en.wikipedia.org/wiki/Top-down_parsing [accessed 2007-07-04] 
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An attribute grammar is called S-attributed 76 if it allows only synthesized attributes [452] . 
Because of this restriction, such grammars can be parsed top-down as well as directly bottom- 
up 77 and are supported by various tools like Bison 78 and Flex 79 . 

30.3.7 Extended Attribute Grammars 

Extended Attribute Grammars developed by Watt [2162] and Madsen [1344] (EAGs) are a 
form of attribute grammars where the semantic (attribute-concerning) rules are no longer 
separated from the syntax productions [1874]. Instead, both are combined in a declarative 
form where each non-terminal symbol is accompanied by its attributes listed in a predeter- 
mined order. The new syntax for non-terminal symbols is 

<n Ja Jb Jc . . .> 

Listing 30.8: Syntax of an Extended Attribute Grammar symbol. 

While n e N is a non-terminal symbol and a, b, and c are values of attributes a, /?, and 7 
defined as expressions over their respective attribute value domain. In an extended attribute 
grammar, we can define a set of inherited attributes /(n) and a set of synthesized attributes 
T(n) for each non-terminal symbol n. In our example Listing 30.8, | therefore has to be 
replaced with either J. which means that the following attribute is inherited (J, a a G -f(n)) 
or t denoting a synthesized attribute (j a a e T(n. parent)) where n. parent stands for 
the parent node of n in the derivation tree. Terminal symbols cannot have attributes. Again, 
notice that the identifiers a, b, and c do not denote the attribute names but expressions 
that define their values. Attributes in EAGs are solely identified by their position in the 
non-terminal symbol specifications. 

How this approach works is best understood when again, using a simple example bor- 
rowed from [1874]. Assume the grammar G\ — (N, S, P, S) with the non-terminal symbols 
N = {S, X, Y, Z}, the alphabet £ — {x, y, z, e}, productions P as defined below and the start 
symbol S = S. Additionally, X, Y, and Z are equipped with one synthesized attribute v e No- 

<S> ::= <X jvXY jvXZ fv> 

<X fv+l> : := <X |v>"x" 

<Y jv+l> : := <Y Tv>"y" 

<Z tv+l> : := <Z Tv>"z" 

<X |0> : := £ 

<Y |0> : := £ 

<Z |0> : := £ 

Listing 30.9: The small example G\ for Extended Attribute Grammars. 

In the listing, below a typical expansion of G\ is illustrated. Since the same attribute v is 
attached to all three non-terminals X, Y, and Z, the terminal symbols x, y, and z will always 
occur equally often. The context-sensitive grammar specified in Listing 30.9 thus defines 
sentences in the form x n y n z™. 

<S> > <X |2><Y |2><Z |2> > <X |l>x<Y |2><Z |2> 

> <X |0>xx<Y |2XZ |2> > xx<Y |2XZ |2> 

> xx<Y Tl>y<Z |2> > xx<Y |0>yy<Z |2> 

> xxyy<Z |2> > xxyy<Z |l>z > xxyy<Z |0>zz 

> xxyyzz 

Listing 30.10: A typical expansion of G\. 



http://en.wikipedia.org/wiki/S-attributed_grammar [accosaod 2007-07-04] 

77 http://en.wikipedia.org/wiki/Bottom-up_parsing [accessed 2007-07-04] 

78 http://en.wikipedia.org/wiki/GNU_Bison [accessed 2007-07-04] 

79 http://en.wikipedia.org/wiki/Flex_lexical_analyser [accessed 2007-07-04] 
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Another example for extended attribute grammars, again borrowed from [1874], demon- 
strates the specification of binary numbers. We can define a grammar G2 = (N, E, P, S) for 
all binary numbers. In this grammar, the start symbol S will have an attribute including the 
value of number represented by the generated sentence. Here we need three non-terminal 
symbols N — {S,T, B} and only two terminal symbols S — {0, 1}. The productions P are 
specified as follows: 
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30.11: 


An extended attribute grammar G2 for binary numbers. 



Figure 30.10 illustrates one possible expansion of the start symbol S — S with the 
extended attribute grammar G2. As you can see, S has attached the (decimal) value 10 
corresponding to the (binary) value 1010 of the binary string represented by the generated 
sentence. 



<S tl0> 



<T 4-0 tl0> 
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<B 4-3 t8> 
y\ 

(( 1 jj ttr\ 11 



<B 4-0 tOs 
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Figure 30.10: One possible expansion of the example grammar G2. 



Extended Attribute Grammars are sufficient to specify the syntax and semantics of many 
programming languages [2162]. 

30.3.8 Adaptive Grammars 

Definition 30.41 (Adaptive Grammar). An Adaptive Grammar 80 G = (N, S, P, S) is 

a formal grammar developed by Shutt [1874] in which the set of non-terminal symbols N, 
the set of terminal symbols £ and the set of productions P may vary during parsing. 

Shutt [1874] furthermore discusses Recursive Adaptive Grammars (RAG) which are a 
Turing-complete formalism but yet retain the elegance of context-free grammars. 

80 http://en.wikipedia.org/wiki/Adaptive_grammar [accessed 2007-07-13] 
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30.3.9 Christiansen Grammars 

Christiansen [402] introduces an adaptable grammar model that combines Extended At- 
tribute Grammars with the ability to adapt according to Definition 30.41 [402, 405, 406]. 

Unfortunately, Christiansen [403] calls his adaptable attribute grammars "generative 
grammars" [403, 404] which has already another meaning (see Section 30.3.2 on page 562). 
We therefore resort to the term "Christiansen Grammars" coined by Shutt [1874] from whom 
we again will borrow the examples. As described in [402, 406], a Christiansen Grammar is an 
Extended Attribute Grammar where the first attribute of each non-terminal symbol n e TV is 
an inherited Christiansen Grammar itself. This attribute is called language attribute and the 
expansion of the non-terminal symbol it belongs to must be done according to the grammar 
represented by it. 

<n |g Ja Jb ...> 

The statement X<n ig J . . . >Z : : = XYZ (with X, Y, Z G V and nGJV) hence only holds 
if <n lg | a ...>::= Y according to the grammar attribute g. 

Let us start with a simple example grammar G3 = (N, E, P, S) with the non-terminal 
symbols alpha-list and alpha, the Latin alphabet as set of terminal symbols S, the 
alpha-list as start symbol S and the set of productions P as specified below. 

<alpha-list jg |w> ::= <alpha jg |w> 

<alpha-list jg |wlow2> ::= <alpha jg |wl >< alpha - li st jg jw2> 
<alpha jg T"a"> : := "a" 

<alpha |g T"z"> : := "z" 

Listing 30.12: A Christiansen Grammar creating character strings. 

It clearly generates the character strings over the Latin alphabet. The start symbol has 
two attributes: The inherited Christiansen Grammar g will be handed down to all generated 
symbols. The attribute w on the other hand is synthesized from these symbols and contains 
the character string generated. 

Basing on this grammar which still is a mere EAG in principle, we build the Christiansen 
Grammar G4 = (N, E, P, S) for a subset of the C (or Java) programming language where 
all value assignments are valid: 



<program |g0> 

<decl-list jg jg> 
<decl-list jgO |g2> 
<decl jg tg+new-rule> 

where new-rule 
<stmnt -list jg> 
<stmnt -list jg> 
<stmnt jg> 



"{"<decl-list jgO Tgl> 
<stmnt-list |gl>"}" 

£ 

<decl jgO Tg2Xdecl-list jgl jg2> 
"int" <alpha-list jg jw> ";" 
is <id J.h> : : = w 

£ 

<stmnt |g>< stmnt - 1st jg> 
<id |g> "=" <id |g> » ; » 



Listing 30.13: Christiansen Grammar for a simple programming language. 

Whenever the non-terminal symbol deel is expanded, it also adds a new rule to the 
grammar. By introducing a new production for the symbol id, the declared variable becomes 
available in stmt since the grammar is synthesized upwards to the production for program 
and then inherited downwards into stmt-lst. A more thorough example of Christiansen 
Grammar in the context of Genetic Programming can be found in Listing 4.7. 



30.3.10 Tree- Adjoining Grammars 

Tree-adjoining grammars 81 (TAG, also called tree-adjunct grammars) are another method 
for defining formal grammars which has been developed by Joshi [1072]. [1704, 1073] Different 

81 http://en.wikipedia.org/wiki/Tree-adjoining_grammar [accessed 2007-07-03] 
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from BNF and EBNF, they are based on trees instead of plain strings. The inner nodes of 
the (fully expanded) trees correspond to non-terminal symbols and the leaves to terminal 
symbols. 




correspond to non-terminal symbols 



correspond to terminal symbols 



Figure 30.11: An example TAG tree. 



The simple TAG tree illustrated in Figure 30.11 is borrowed from [1073] as well as some 
of the following examples. The tree structure of tree-adjoining grammars has one striking 
advantage compared to the flat rules in context-free grammars: the increased domain of 
locality [1704]. If we process for example an EBNF rule, we can only expand the non- 
terminal symbols at our current "level" of the derivation tree. Below we show that a text 
in an EBNF grammar similar to the one of Figure 30.11 could be resolved step by step. 
The variable VP expanded in line 8 for instance cannot be accessed or modified in line 10 
anymore, although it is clearly part of the sentence construction. 
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: = NP , VP 




NP 


:= "John" 


"Lyn" ; 


VP 


:= V, NP ; 




V 


:= "likes" 


1 



VP 

V NP 

"likes" NP 
"likes" "Lyn" 

Listing 30.14: Another simple context-free grammar. 

The extended domain of locality (EDL) in TAG trees is utilized with the two modification 
operators substitution and adjunction. 

We can substitute a tree (3 into a tree a if there is a non-terminal leaf symbol v in a 
that has the same label as the root of (3. The stump of (3 then replaces the node v in a. In 
Figure 30.12 we outline how two trees (3\ and j3i are substituted into a TAG tree a and a 
new tree a' is created. 

Substitution is equivalent to the non-terminal expansion in BNF. The adjunction op- 
erator however adds access to the aforementioned layers which are buried in context-free 
grammars. In order to perform an adjunction, the tree a has to include one non-terminal 
symbol v at some random place. The root of the auxiliary tree is also labeled with v and 
so is at least one of its leafs. We now can replace the node marked with v in a with tree f3. 
Whatever was attached to v before now replaces the leaf node v in (3. The leaf node v in beta 
often is additionally marked with an asterisk (*). Figure 30.13 sketches such a replacement, 
with the result that the new sentence a' now contains the word "really". 

With adjunction, TAGs are somewhere in between context-sensitive and context-free 
grammars. In the definition of a tree-adjoining grammar G — (N, S, A, I, S), A is the set 
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John likes Lyn 



a' 

Figure 30.12: An example for the substitution operation. 



of auxiliary trees to be used in the adjunction operations. I is the set of initial trees that 
can be substituted into existing trees. The unison of I and A, E = I U A is called the set 
of elementary trees and replaces the set of productions P used in Chomsky grammars. N 
and E retain their meaning as set of non-terminal and terminal symbols respectively. Trees 
with the non-terminal symbol X £ N as root are called A-type trees. S £ N denotes the 
starting symbol and there must be at least one S-type elementary tree. 

Definition 30.42 (Lexicalized Tree- Adjoining Grammars). A lexicalized tree- 
adjoining grammar (LTAG) is a tree-adjoining grammar where each elementary tree t £ E 
contains a terminal symbol X £ S. Although they are more restricted, LTAGs are equivalent 
to TAGs. 

A discussion on derivation trees of tree-adjoining grammars can be found in Section 4.5.9 
on page 189. 



30.3.11 S-expressions 

S-expressions 82 (where S stands for symbolic) or sexp are data structures for presenting 
complex data. They are probably best known for their usage in the Lisp 83 [1377, 1379, 1378] 
and Scheme 84 [612] programming languages. Their most common feature is that they are 
parenthesized prefix notations (often also known as Polish notation 85 ). 

82 http://en.wikipedia.org/wiki/S-expression [accessed 2007-07-03] 

83 http://en.wikipedia.org/wiki/Lisp_programming_language [accessed 2007-07-03] 

84 http : //en. wikipedia. org/wiki/Scheme_7,28programming_language , / 1 29 [accessed 2007-07-03] 

85 http://en.wikipedia.org/wiki/Polish_notation [accessed 2007-07-04] 
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Figure 30.13: An example for the adjunction operation. 



In 1997, Rivest [1742] handed in a standardization draft for S-expressions to be considered 
for publication as RFC. It was, however, never approved but still is the foundation for many 
other publications and RFCs. 

1 (defun fibonacci (N) 

2 (if (or (zerop N) (= N 1)) 

3 1 

4 (+ (fibonacci (- N 1)) (fibonacci (- N 2))))) 

Listing 30.15: A small Lisp-example: How to compute Fibonacci numbers. 
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GNU Free Documentation License (FDL) 



Version 1.2, November 2002 

Copyright (C) 2000,2001,2002 Free Software Foundation, Inc. 
51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA 

Everyone is permitted to copy and distribute verbatim copies of this license document, but 
changing it is not allowed. 

A.l Preamble 

The purpose of this License is to make a manual, textbook, or other functional and useful document 
"free" in the sense of freedom: to assure everyone the effective freedom to copy and redistribute 
it, with or without modifying it, either commercially or noncommercially. Secondarily, this License 
preserves for the author and publisher a way to get credit for their work, while not being considered 
responsible for modifications made by others. 

This License is a kind of "copyleft", which means that derivative works of the document must 
themselves be free in the same sense. It complements the GNU General Public License, which is a 
copyleft license designed for free software. 

We have designed this License in order to use it for manuals for free software, because free 
software needs free documentation: a free program should come with manuals providing the same 
freedoms that the software does. But this License is not limited to software manuals; it can be used 
for any textual work, regardless of subject matter or whether it is published as a printed book. We 
recommend this License principally for works whose purpose is instruction or reference. 

A. 2 Applicability and Definitions 

This License applies to any manual or other work, in any medium, that contains a notice placed by 
the copyright holder saying it can be distributed under the terms of this License. Such a notice grants 
a world-wide, royalty-free license, unlimited in duration, to use that work under the conditions stated 
herein. The "Document", below, refers to any such manual or work. Any member of the public is 
a licensee, and is addressed as "you". You accept the license if you copy, modify or distribute the 
work in a way requiring permission under copyright law. 

A " Modified Version" of the Document means any work containing the Document or a portion 
of it, either copied verbatim, or with modifications and/or translated into another language. 

A " Secondary Section" is a named appendix or a front-matter section of the Document that deals 
exclusively with the relationship of the publishers or authors of the Document to the Document's 
overall subject (or to related matters) and contains nothing that could fall directly within that 
overall subject. (Thus, if the Document is in part a textbook of mathematics, a Secondary Section 
may not explain any mathematics.) The relationship could be a matter of historical connection 
with the subject or with related matters, or of legal, commercial, philosophical, ethical or political 
position regarding them. 

The "Invariant Sections" are certain Secondary Sections whose titles are designated, as being 
those of Invariant Sections, in the notice that says that the Document is released under this License. 
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If a section does not fit the above definition of Secondary then it is not allowed to be designated as 
Invariant. The Document may contain zero Invariant Sections. If the Document does not identify 
any Invariant Sections then there are none. 

The "Cover Texts" are certain short passages of text that are listed, as Front-Cover Texts or 
Back-Cover Texts, in the notice that says that the Document is released under this License. A 
Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words. 

A "Transparent" copy of the Document means a machine-readable copy, represented in a for- 
mat whose specification is available to the general public, that is suitable for revising the document 
straightforwardly with generic text editors or (for images composed of pixels) generic paint pro- 
grams or (for drawings) some widely available drawing editor, and that is suitable for input to text 
formatters or for automatic translation to a variety of formats suitable for input to text formatters. 
A copy made in an otherwise Transparent file format whose markup, or absence of markup, has 
been arranged to thwart or discourage subsequent modification by readers is not Transparent. An 
image format is not Transparent if used for any substantial amount of text. A copy that is not 
" Transparent" is called " Opaque" . 

Examples of suitable formats for Transparent copies include plain ASCII without markup, 
Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and 
standard-conforming simple HTML, PostScript or PDF designed for human modification. Exam- 
ples of transparent image formats include PNG, XCF and JPG. Opaque formats include proprietary 
formats that can be read and edited only by proprietary word processors, SGML or XML for which 
the DTD and/or processing tools are not generally available, and the machine-generated HTML, 
PostScript or PDF produced by some word processors for output purposes only. 

The "Title Page" means, for a printed book, the title page itself, plus such following pages as 
are needed to hold, legibly, the material this License requires to appear in the title page. For works 
in formats which do not have any title page as such, "Title Page" means the text near the most 
prominent appearance of the work's title, preceding the beginning of the body of the text. 

A section "Entitled XYZ" means a named subunit of the Document whose title either is precisely 
XYZ or contains XYZ in parentheses following text that translates XYZ in another language. 
(Here XYZ stands for a specific section name mentioned below, such as "Acknowledgements", 
"Dedications", "Endorsements", or "History".) To "Preserve the Title" of such a section when you 
modify the Document means that it remains a section "Entitled XYZ" according to this definition. 

The Document may include Warranty Disclaimers next to the notice which states that this 
License applies to the Document. These Warranty Disclaimers are considered to be included by 
reference in this License, but only as regards disclaiming warranties: any other implication that 
these Warranty Disclaimers may have is void and has no effect on the meaning of this License. 



A. 3 Verbatim Copying 

You may copy and distribute the Document in any medium, either commercially or noncommercially, 
provided that this License, the copyright notices, and the license notice saying this License applies 
to the Document are reproduced in all copies, and that you add no other conditions whatsoever 
to those of this License. You may not use technical measures to obstruct or control the reading 
or further copying of the copies you make or distribute. However, you may accept compensation 
in exchange for copies. If you distribute a large enough number of copies you must also follow the 
conditions in Section A. 4. 

You may also lend copies, under the same conditions stated above, and you may publicly display 
copies. 



A. 4 Copying in Quantity 

If you publish printed copies (or copies in media that commonly have printed covers) of the Doc- 
ument, numbering more than 100, and the Document's license notice requires Cover Texts, you 
must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover 
Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly 
and legibly identify you as the publisher of these copies. The front cover must present the full title 
with all words of the title equally prominent and visible. You may add other material on the covers 
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in addition. Copying with changes limited to the covers, as long as they preserve the title of the 
Document and satisfy these conditions, can be treated as verbatim copying in other respects. 

If the required texts for either cover are too voluminous to fit legibly, you should put the first 
ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent 
pages. 

If you publish or distribute Opaque copies of the Document numbering more than 100, you must 
either include a machine-readable Transparent copy along with each Opaque copy, or state in or 
with each Opaque copy a computer-network location from which the general network-using public 
has access to download using public-standard network protocols a complete Transparent copy of the 
Document, free of added material. If you use the latter option, you must take reasonably prudent 
steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent 
copy will remain thus accessible at the stated location until at least one year after the last time 
you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the 
public. 

It is requested, but not required, that you contact the authors of the Document well before 
redistributing any large number of copies, to give them a chance to provide you with an updated 
version of the Document. 



A. 5 Modifications 

You may copy and distribute a Modified Version of the Document under the conditions of Sec- 
tion A. 3 and Section A. 43 above, provided that you release the Modified Version under precisely 
this License, with the Modified Version filling the role of the Document, thus licensing distribution 
and modification of the Modified Version to whoever possesses a copy of it. In addition, you must 
do these things in the Modified Version: 

A. Use in the Title Page (and on the covers, if any) a title distinct from that of the Document, and 
from those of previous versions (which should, if there were any, be listed in the History section 
of the Document). You may use the same title as a previous version if the original publisher of 
that version gives permission. 

B. List on the Title Page, as authors, one or more persons or entities responsible for authorship of 
the modifications in the Modified Version, together with at least five of the principal authors 
of the Document (all of its principal authors, if it has fewer than five), unless they release you 
from this requirement. 

C. State on the Title page the name of the publisher of the Modified Version, as the publisher. 

D. Preserve all the copyright notices of the Document. 

E. Add an appropriate copyright notice for your modifications adjacent to the other copyright 
notices. 

F. Include, immediately after the copyright notices, a license notice giving the public permission to 
use the Modified Version under the terms of this License, in the form shown in the Addendum 
below. 

G. Preserve in that license notice the full lists of Invariant Sections and required Cover Texts given 
in the Document's license notice. 

H. Include an unaltered copy of this License. 

I. Preserve the section Entitled "History" , Preserve its Title, and add to it an item stating at least 
the title, year, new authors, and publisher of the Modified Version as given on the Title Page. 
If there is no section Entitled "History" in the Document, create one stating the title, year, 
authors, and publisher of the Document as given on its Title Page, then add an item describing 
the Modified Version as stated in the previous sentence. 

J. Preserve the network location, if any, given in the Document for public access to a Transparent 
copy of the Document, and likewise the network locations given in the Document for previous 
versions it was based on. These may be placed in the "History" section. You may omit a network 
location for a work that was published at least four years before the Document itself, or if the 
original publisher of the version it refers to gives permission. 

K. For any section Entitled "Acknowledgements" or "Dedications", Preserve the Title of the sec- 
tion, and preserve in the section all the substance and tone of each of the contributor acknowl- 
edgements and/or dedications given therein. 
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L. Preserve all the Invariant Sections of the Document, unaltered in their text and in their titles. 

Section numbers or the equivalent are not considered part of the section titles. 
M. Delete any section Entitled "Endorsements" . Such a section may not be included in the Modified 
Version. 

N. Do not retitle any existing section to be Entitled "Endorsements" or to conflict in title with 

any Invariant Section. 
O. Preserve any Warranty Disclaimers. 

If the Modified Version includes new front-matter sections or appendices that qualify as Sec- 
ondary Sections and contain no material copied from the Document, you may at your option des- 
ignate some or all of these sections as invariant. To do this, add their titles to the list of Invariant 
Sections in the Modified Version's license notice. These titles must be distinct from any other section 
titles. 

You may add a section Entitled "Endorsements", provided it contains nothing but endorsements 
of your Modified Version by various parties-for example, statements of peer review or that the text 
has been approved by an organization as the authoritative definition of a standard. 

You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 
25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. 
Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through 
arrangements made by) any one entity. If the Document already includes a cover text for the same 
cover, previously added by you or by arrangement made by the same entity you are acting on 
behalf of, you may not add another; but you may replace the old one, on explicit permission from 
the previous publisher that added the old one. 

The author(s) and publisher(s) of the Document do not by this License give permission to use 
their names for publicity for or to assert or imply endorsement of any Modified Version. 



A. 6 Combining Documents 

You may combine the Document with other documents released under this License, under the terms 
defined in Section A. 5 above for modified versions, provided that you include in the combination all 
of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant 
Sections of your combined work in its license notice, and that you preserve all their Warranty 
Disclaimers. 

The combined work need only contain one copy of this License, and multiple identical Invariant 
Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same 
name but different contents, make the title of each such section unique by adding at the end of 
it, in parentheses, the name of the original author or publisher of that section if known, or else a 
unique number. Make the same adjustment to the section titles in the list of Invariant Sections in 
the license notice of the combined work. 

In the combination, you must combine any sections Entitled "History" in the various original 
documents, forming one section Entitled "History"; likewise combine any sections Entitled "Ac- 
knowledgements", and any sections Entitled "Dedications". You must delete all sections Entitled 
"Endorsements." 



A. 7 Collections of Documents 

You may make a collection consisting of the Document and other documents released under this 
License, and replace the individual copies of this License in the various documents with a single 
copy that is included in the collection, provided that you follow the rules of this License for verbatim 
copying of each of the documents in all other respects. 

You may extract a single document from such a collection, and distribute it individually under 
this License, provided you insert a copy of this License into the extracted document, and follow this 
License in all other respects regarding verbatim copying of that document. 
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A. 8 Aggregation with Independent Works 

A compilation of the Document or its derivatives with other separate and independent documents 
or works, in or on a volume of a storage or distribution medium, is called an "aggregate" if the 
copyright resulting from the compilation is not used to limit the legal rights of the compilation's 
users beyond what the individual works permit. When the Document is included in an aggregate, 
this License does not apply to the other works in the aggregate which are not themselves derivative 
works of the Document. 

If the Cover Text requirement of Section A. 4 is applicable to these copies of the Document, then 
if the Document is less than one half of the entire aggregate, the Document's Cover Texts may be 
placed on covers that bracket the Document within the aggregate, or the electronic equivalent of 
covers if the Document is in electronic form. Otherwise they must appear on printed covers that 
bracket the whole aggregate. 



A. 9 Translation 

Translation is considered a kind of modification, so you may distribute translations of the Docu- 
ment under the terms of Section A. 5. Replacing Invariant Sections with translations requires special 
permission from their copyright holders, but you may include translations of some or all Invariant 
Sections in addition to the original versions of these Invariant Sections. You may include a trans- 
lation of this License, and all the license notices in the Document, and any Warranty Disclaimers, 
provided that you also include the original English version of this License and the original versions 
of those notices and disclaimers. In case of a disagreement between the translation and the original 
version of this License or a notice or disclaimer, the original version will prevail. 

If a section in the Document is Entitled "Acknowledgements", "Dedications", or "History", the 
requirement (Section A. 5) to Preserve its Title (Section A. 2) will typically require changing the 
actual title. 



A. 10 Termination 

You may not copy, modify, sublicense, or distribute the Document except as expressly provided for 
under this License. Any other attempt to copy, modify, sublicense or distribute the Document is 
void, and will automatically terminate your rights under this License. However, parties who have 
received copies, or rights, from you under this License will not have their licenses terminated so 
long as such parties remain in full compliance. 



A. 11 Future Revisions of this License 

The Free Software Foundation may publish new, revised versions of the GNU Free Documentation 
License from time to time. Such new versions will be similar in spirit to the present version, but 
may differ in detail to address new problems or concerns. See http://www.gnu.org/copyleft/. 

Each version of the License is given a distinguishing version number. If the Document specifies 
that a particular numbered version of this License "or any later version" applies to it, you have the 
option of following the terms and conditions either of that specified version or of any later version 
that has been published (not as a draft) by the Free Software Foundation. If the Document does 
not specify a version number of this License, you may choose any version ever published (not as a 
draft) by the Free Software Foundation. 

How to use this License for your documents 

To use this License in a document you have written, include a copy of the License in the 
document and put the following copyright and license notices just after the title page: 
Copyright (c) YEAR YOUR NAME. 

Permission is granted to copy, distribute and/or modify this document 
under the terms of the GNU Free Documentation License, Version 1.2 
or any later version published by the Free Software Foundation; 
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with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. 
A copy of the license is included in the section entitled " GNU 
Free Documentation License". 

If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the 
"with. ..Texts." line with this: 

with the Invariant Sections being LIST THEIR TITLES, with the 
Front-Cover Texts being LIST, and with the Back-Cover Texts being LIST. 

If you have Invariant Sections without Cover Texts, or some other combination of the three, 
merge those two alternatives to suit the situation. 

If your document contains nontrivial examples of program code, we recommend releasing these 
examples in parallel under your choice of free software license, such as the GNU General Public 
License, to permit their use in free software. 



B 
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Version 2.1, February 1999 

Copyright (C) 1991, 1999 Free Software Foundation, Inc. 

51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 

Everyone is permitted to copy and distribute verbatim copies of this license document, but 
changing it is not allowed. 

[This is the first released version of the Lesser GPL. It also counts as the successor of the GNU 
Library Public License, version 2, hence the version number 2.1] 



B.l Preamble 

The licenses for most software are designed to take away your freedom to share and change it. By 
contrast, the GNU General Public Licenses are intended to guarantee your freedom to share and 
change free software-to make sure the software is free for all its users. 

This license, the Lesser General Public License, applies to some specially designated software 
packages-typically libraries-of the Free Software Foundation and other authors who decide to use 
it. You can use it too, but we suggest you first think carefully about whether this license or the 
ordinary General Public License is the better strategy to use in any particular case, based on the 
explanations below. 

When we speak of free software, we are referring to freedom of use, not price. Our General 
Public Licenses are designed to make sure that you have the freedom to distribute copies of free 
software (and charge for this service if you wish); that you receive source code or can get it if you 
want it; that you can change the software and use pieces of it in new free programs; and that you 
are informed that you can do these things. 

To protect your rights, we need to make restrictions that forbid distributors to deny you these 
rights or to ask you to surrender these rights. These restrictions translate to certain responsibilities 
for you if you distribute copies of the library or if you modify it. 

For example, if you distribute copies of the library, whether gratis or for a fee, you must give 
the recipients all the rights that we gave you. You must make sure that they, too, receive or can 
get the source code. If you link other code with the library, you must provide complete object files 
to the recipients, so that they can relink them with the library after making changes to the library 
and recompiling it. And you must show them these terms so they know their rights. 

We protect your rights with a two-step method: (1) we copyright the library, and (2) we offer 
you this license, which gives you legal permission to copy, distribute and/or modify the library. 

To protect each distributor, we want to make it very clear that there is no warranty for the free 
library. Also, if the library is modified by someone else and passed on, the recipients should know 
that what they have is not the original version, so that the original author's reputation will not be 
affected by problems that might be introduced by others. 

Finally, software patents pose a constant threat to the existence of any free program. We wish 
to make sure that a company cannot effectively restrict the users of a free program by obtaining a 
restrictive license from a patent holder. Therefore, we insist that any patent license obtained for a 
version of the library must be consistent with the full freedom of use specified in this license. 
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Most GNU software, including some libraries, is covered by the ordinary GNU General Public 
License. This license, the GNU Lesser General Public License, applies to certain designated libraries, 
and is quite different from the ordinary General Public License. We use this license for certain 
libraries in order to permit linking those libraries into non-free programs. 

When a program is linked with a library, whether statically or using a shared library, the 
combination of the two is legally speaking a combined work, a derivative of the original library. 
The ordinary General Public License therefore permits such linking only if the entire combination 
fits its criteria of freedom. The Lesser General Public License permits more lax criteria for linking 
other code with the library. 

We call this license the "Lesser" General Public License because it does Less to protect the user's 
freedom than the ordinary General Public License. It also provides other free software developers 
Less of an advantage over competing non-free programs. These disadvantages are the reason we 
use the ordinary General Public License for many libraries. However, the Lesser license provides 
advantages in certain special circumstances. 

For example, on rare occasions, there may be a special need to encourage the widest possible 
use of a certain library, so that it becomes a de-facto standard. To achieve this, non-free programs 
must be allowed to use the library. A more frequent case is that a free library does the same job as 
widely used non-free libraries. In this case, there is little to gain by limiting the free library to free 
software only, so we use the Lesser General Public License. 

In other cases, permission to use a particular library in non-free programs enables a greater 
number of people to use a large body of free software. For example, permission to use the GNU C 
Library in non-free programs enables many more people to use the whole GNU operating system, 
as well as its variant, the GNU/Linux operating system. 

Although the Lesser General Public License is Less protective of the users' freedom, it does en- 
sure that the user of a program that is linked with the Library has the freedom and the wherewithal 
to run that program using a modified version of the Library. 

The precise terms and conditions for copying, distribution and modification follow. Pay close 
attention to the difference between a "work based on the library" and a "work that uses the library" . 
The former contains code derived from the library, whereas the latter must be combined with the 
library in order to run. 

B.2 Terms and Conditions for Copying, Distribution and 
Modification 

1. This License Agreement applies to any software library or other program which contains a 
notice placed by the copyright holder or other authorized party saying it may be distributed 
under the terms of this Lesser General Public License (also called "this License"). Each licensee 
is addressed as "you". 

A "library" means a collection of software functions and/or data prepared so as to be conve- 
niently linked with application programs (which use some of those functions and data) to form 
executables. 

The "Library", below, refers to any such software library or work which has been distributed 
under these terms. A "work based on the Library" means either the Library or any derivative 
work under copyright law: that is to say, a work containing the Library or a portion of it, ei- 
ther verbatim or with modifications and/or translated straightforwardly into another language. 
(Hereinafter, translation is included without limitation in the term "modification".) 
"Source code" for a work means the preferred form of the work for making modifications to 
it. For a library, complete source code means all the source code for all modules it contains, 
plus any associated interface definition files, plus the scripts used to control compilation and 
installation of the library. 

Activities other than copying, distribution and modification are not covered by this License; 
they are outside its scope. The act of running a program using the Library is not restricted, 
and output from such a program is covered only if its contents constitute a work based on the 
Library (independent of the use of the Library in a tool for writing it). Whether that is true 
depends on what the Library does and what the program that uses the Library does. 

2. You may copy and distribute verbatim copies of the Library's complete source code as you 
receive it, in any medium, provided that you conspicuously and appropriately publish on each 
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copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices 
that refer to this License and to the absence of any warranty; and distribute a copy of this 
License along with the Library. 

You may charge a fee for the physical act of transferring a copy, and you may at your option 
offer warranty protection in exchange for a fee. 

3. You may modify your copy or copies of the Library or any portion of it, thus forming a work 
based on the Library, and copy and distribute such modifications or work under the terms of 
Section 1 above, provided that you also meet all of these conditions: 

a) The modified work must itself be a software library. 

b) You must cause the files modified to carry prominent notices stating that you changed the 
files and the date of any change. 

c) You must cause the whole of the work to be licensed at no charge to all third parties under 
the terms of this License. 

d) If a facility in the modified Library refers to a function or a table of data to be supplied by 
an application program that uses the facility, other than as an argument passed when the 
facility is invoked, then you must make a good faith effort to ensure that, in the event an 
application does not supply such function or table, the facility still operates, and performs 
whatever part of its purpose remains meaningful. 

(For example, a function in a library to compute square roots has a purpose that is en- 
tirely well-defined independent of the application. Therefore, Subsection 2d requires that 
any application-supplied function or table used by this function must be optional: if the 
application does not supply it, the square root function must still compute square roots.) 
These requirements apply to the modified work as a whole. If identifiable sections of that 
work are not derived from the Library, and can be reasonably considered independent and 
separate works in themselves, then this License, and its terms, do not apply to those sections 
when you distribute them as separate works. But when you distribute the same sections as 
part of a whole which is a work based on the Library, the distribution of the whole must 
be on the terms of this License, whose permissions for other licensees extend to the entire 
whole, and thus to each and every part regardless of who wrote it. 

Thus, it is not the intent of this section to claim rights or contest your rights to work 
written entirely by you; rather, the intent is to exercise the right to control the distribution 
of derivative or collective works based on the Library. 

In addition, mere aggregation of another work not based on the Library with the Library 
(or with a work based on the Library) on a volume of a storage or distribution medium 
does not bring the other work under the scope of this License. 

4. You may opt to apply the terms of the ordinary GNU General Public License instead of this 
License to a given copy of the Library. To do this, you must alter all the notices that refer to this 
License, so that they refer to the ordinary GNU General Public License, version 2, instead of 
to this License. (If a newer version than version 2 of the ordinary GNU General Public License 
has appeared, then you can specify that version instead if you wish.) Do not make any other 
change in these notices. 

Once this change is made in a given copy, it is irreversible for that copy, so the ordinary GNU 
General Public License applies to all subsequent copies and derivative works made from that 
copy. 

This option is useful when you wish to copy part of the code of the Library into a program that 
is not a library. 

5. You may copy and distribute the Library (or a portion or derivative of it, under Section 2) 
in object code or executable form under the terms of Sections 1 and 2 above provided that 
you accompany it with the complete corresponding machine-readable source code, which must 
be distributed under the terms of Sections 1 and 2 above on a medium customarily used for 
software interchange. 

If distribution of object code is made by offering access to copy from a designated place, then 
offering equivalent access to copy the source code from the same place satisfies the requirement 
to distribute the source code, even though third parties are not compelled to copy the source 
along with the object code. 

6. A program that contains no derivative of any portion of the Library, but is designed to work 
with the Library by being compiled or linked with it, is called a "work that uses the Library". 
Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the 
scope of this License. 
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However, linking a "work that uses the Library" with the Library creates an executable that is 
a derivative of the Library (because it contains portions of the Library), rather than a "work 
that uses the library". The executable is therefore covered by this License. Section 6 states 
terms for distribution of such executables. 

When a "work that uses the Library" uses material from a header file that is part of the Library, 
the object code for the work may be a derivative work of the Library even though the source 
code is not. Whether this is true is especially significant if the work can be linked without the 
Library, or if the work is itself a library. The threshold for this to be true is not precisely defined 
by law. 

If such an object file uses only numerical parameters, data structure layouts and accessors, and 
small macros and small inline functions (ten lines or less in length), then the use of the object 
file is unrestricted, regardless of whether it is legally a derivative work. (Executables containing 
this object code plus portions of the Library will still fall under Section 6.) 
Otherwise, if the work is a derivative of the Library, you may distribute the object code for 
the work under the terms of Section 6. Any executables containing that work also fall under 
Section 6, whether or not they are linked directly with the Library itself. 

7. As an exception to the Sections above, you may also combine or link a "work that uses the 
Library" with the Library to produce a work containing portions of the Library, and distribute 
that work under terms of your choice, provided that the terms permit modification of the work 
for the customer's own use and reverse engineering for debugging such modifications. 

You must give prominent notice with each copy of the work that the Library is used in it and 
that the Library and its use are covered by this License. You must supply a copy of this License. 
If the work during execution displays copyright notices, you must include the copyright notice 
for the Library among them, as well as a reference directing the user to the copy of this License. 
Also, you must do one of these things: 

a) Accompany the work with the complete corresponding machine-readable source code for 
the Library including whatever changes were used in the work (which must be distributed 
under Sections 1 and 2 above); and, if the work is an executable linked with the Library, 
with the complete machine-readable "work that uses the Library", as object code and/or 
source code, so that the user can modify the Library and then relink to produce a modified 
executable containing the modified Library. (It is understood that the user who changes 
the contents of definitions files in the Library will not necessarily be able to recompile the 
application to use the modified definitions.) 

b) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism 
is one that (1) uses at run time a copy of the library already present on the user's computer 
system, rather than copying library functions into the executable, and (2) will operate 
properly with a modified version of the library, if the user installs one, as long as the 
modified version is interface-compatible with the version that the work was made with. 

c) Accompany the work with a written offer, valid for at least three years, to give the same 
user the materials specified in Subsection 6a, above, for a charge no more than the cost of 
performing this distribution. 

d) If distribution of the work is made by offering access to copy from a designated place, offer 
equivalent access to copy the above specified materials from the same place. 

e) Verify that the user has already received a copy of these materials or that you have already 
sent this user a copy. 

For an executable, the required form of the "work that uses the Library" must include any 
data and utility programs needed for reproducing the executable from it. However, as a special 
exception, the materials to be distributed need not include anything that is normally distributed 
(in cither source or binary form) with the major components (compiler, kernel, and so on) of 
the operating system on which the executable runs, unless that component itself accompanies 
the executable. 

It may happen that this requirement contradicts the license restrictions of other proprietary 
libraries that do not normally accompany the operating system. Such a contradiction means 
you cannot use both them and the Library together in an executable that you distribute. 

8. You may place library facilities that are a work based on the Library side-by-side in a single 
library together with other library facilities not covered by this License, and distribute such a 
combined library, provided that the separate distribution of the work based on the Library and 
of the other library facilities is otherwise permitted, and provided that you do these two things: 
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a) Accompany the combined library with a copy of the same work based on the Library, 
uncombined with any other library facilities. This must be distributed under the terms of 
the Sections above. 

b) Give prominent notice with the combined library of the fact that part of it is a work based 
on the Library, and explaining where to find the accompanying uncombined form of the 
same work. 

9. You may not copy, modify, sublicense, link with, or distribute the Library except as expressly 
provided under this License. Any attempt otherwise to copy, modify, sublicense, link with, or 
distribute the Library is void, and will automatically terminate your rights under this License. 
However, parties who have received copies, or rights, from you under this License will not have 
their licenses terminated so long as such parties remain in full compliance. 

10. You are not required to accept this License, since you have not signed it. However, nothing else 
grants you permission to modify or distribute the Library or its derivative works. These actions 
are prohibited by law if you do not accept this License. Therefore, by modifying or distributing 
the Library (or any work based on the Library), you indicate your acceptance of this License 
to do so, and all its terms and conditions for copying, distributing or modifying the Library or 
works based on it. 

11. Each time you redistribute the Library (or any work based on the Library), the recipient 
automatically receives a license from the original licensor to copy, distribute, link with or modify 
the Library subject to these terms and conditions. You may not impose any further restrictions 
on the recipients' exercise of the rights granted herein. You are not responsible for enforcing 
compliance by third parties with this License. 

12. If, as a consequence of a court judgment or allegation of patent infringement or for any other 
reason (not limited to patent issues), conditions are imposed on you (whether by court order, 
agreement or otherwise) that contradict the conditions of this License, they do not excuse you 
from the conditions of this License. If you cannot distribute so as to satisfy simultaneously 
your obligations under this License and any other pertinent obligations, then as a consequence 
you may not distribute the Library at all. For example, if a patent license would not permit 
royalty-free redistribution of the Library by all those who receive copies directly or indirectly 
through you, then the only way you could satisfy both it and this License would be to refrain 
entirely from distribution of the Library. 

If any portion of this section is held invalid or unenforceable under any particular circumstance, 
the balance of the section is intended to apply, and the section as a whole is intended to apply 
in other circumstances. 

It is not the purpose of this section to induce you to infringe any patents or other property 
right claims or to contest validity of any such claims; this section has the sole purpose of 
protecting the integrity of the free software distribution system which is implemented by public 
license practices. Many people have made generous contributions to the wide range of software 
distributed through that system in reliance on consistent application of that system; it is up 
to the author/donor to decide if he or she is willing to distribute software through any other 
system and a licensee cannot impose that choice. 

This section is intended to make thoroughly clear what is believed to be a consequence of the 
rest of this License. 

13. If the distribution and/or use of the Library is restricted in certain countries either by patents 
or by copyrighted interfaces, the original copyright holder who places the Library under this 
License may add an explicit geographical distribution limitation excluding those countries, so 
that distribution is permitted only in or among countries not thus excluded. In such case, this 
License incorporates the limitation as if written in the body of this License. 

14. The Free Software Foundation may publish revised and/or new versions of the Lesser General 
Public License from time to time. Such new versions will be similar in spirit to the present 
version, but may differ in detail to address new problems or concerns. 

Each version is given a distinguishing version number. If the Library specifies a version number 
of this License which applies to it and "any later version", you have the option of following 
the terms and conditions either of that version or of any later version published by the Free 
Software Foundation. If the Library does not specify a license version number, you may choose 
any version ever published by the Free Software Foundation. 

15. If you wish to incorporate parts of the Library into other free programs whose distribution 
conditions are incompatible with these, write to the author to ask for permission. For software 
which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we 
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sometimes make exceptions for this. Our decision will be guided by the two goals of preserving 
the free status of all derivatives of our free software and of promoting the sharing and reuse of 
software generally. 



B.3 No Warranty 

1. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 
FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT 
WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR 
OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY 
KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE 
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE 
LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME 
THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 

2. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRIT- 
ING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MOD- 
IFY AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE 
TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR 
CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE 
THE LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING 
RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR 
A FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN 
IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 
SUCH DAMAGES. 

END OF TERMS AND CONDITIONS 
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If you develop a new library, and you want it to be of the greatest possible use to the public, 
we recommend making it free software that everyone can redistribute and change. You can do so 
by permitting redistribution under these terms (or, alternatively, under the terms of the ordinary 
General Public License). 

To apply these terms, attach the following notices to the library. It is safest to attach them to 
the start of each source file to most effectively convey the exclusion of warranty; and each file should 
have at least the "copyright" line and a pointer to where the full notice is found. 

one line to give the library's name and an idea of what it does. Copyright (C) year name of 
author 

This library is free software; you can redistribute it and/or modify it under the terms of the 
GNU Lesser General Public License as published by the Free Software Foundation; either version 
2.1 of the License, or (at your option) any later version. 

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; 
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR 
PURPOSE. See the GNU Lesser General Public License for more details. 

You should have received a copy of the GNU Lesser General Public License along with this 
library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, 
MA 02110-1301 USA 

Also add information on how to contact you by electronic and paper mail. 

You should also get your employer (if you work as a programmer) or your school, if any, to sign 
a "copyright disclaimer" for the library, if necessary. Here is a sample; alter the names: 
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Yoyodyne, Inc., hereby disclaims all copyright interest in the library 'Frob' (a library for tweaking 
knobs) written by James Random Hacker. 

signature of Ty Coon, 1 April 1990 Ty Coon, President of Vice 

That's all there is to it! 
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